Tariq Massaoudi

The Ultimate Guide to Rate Limiting: Algorithms, Use Cases, and Cloud Solutions

Mon, 26 May 2025 22:12:03 GMT

by ChatGPT

Introduction

When building an API or any system that handles large volumes of requests, one crucial challenge you’ll face is how to manage and control traffic. Enter rate limiting — the process that ensures your system doesn’t get overwhelmed by too many requests at once. Whether it’s to prevent abuse, ensure fairness, or just to keep things running smoothly, understanding the right way to implement rate limiting is essential. This article will walk you through the different types of rate limiters, their real-world applications, and how to design an effective one for your system.

How Rate Limiting Works and Why Use It

How Rate Limiting Works

Rate limiting typically involves tracking the number of requests a user or client makes within a specified time frame (like seconds, minutes, or hours). If the user exceeds the allowed number of requests, the system blocks or delays the excess requests until the next time window begins.

Here’s a simple flow of how it works:

Request is made: A user sends a request to the system.
Check request count: The system checks how many requests the user has made in the current time window.
Check against limit: If the user has made too many requests, the system responds with an error (commonly HTTP 429 — Too Many Requests). If the limit hasn’t been reached, the request is processed as usual.
Window resets: Once the time window expires, the request count is reset, and the user can make new requests within the limit.

Press enter or click to view image in full size

How rate limiting works

Depending on the algorithm used, the method for counting and handling requests varies, but the basic principle remains the same.

Why Use Rate Limiting?

Prevent Overload:
Too many requests at once can overwhelm your servers, leading to crashes or degraded performance. By controlling the flow of traffic, rate limiting ensures that your system can handle the load without going down.
Fairness:
Without rate limiting, some users could hog resources, leaving others with a poor experience. By limiting the number of requests, you ensure that all users get a fair share of the system’s capacity.
Protect from Abuse:
Rate limiting helps prevent malicious users from exploiting your system. For example, a malicious actor could try to flood your API with requests to crash it or scrape sensitive data. Rate limiting ensures they can’t make too many requests in a short time.

Key Rate Limiting Algorithms

When choosing a rate limiter, the algorithm you pick depends on your use case. Each approach comes with its own advantages and trade-offs. Let’s take a look at the most common algorithms used in rate limiting, and when you might want to use them.

1. Token Bucket

The Token Bucket algorithm is one of the most flexible and widely used for rate limiting. It’s designed to allow for bursts of traffic while maintaining a steady flow of requests. Here’s how it works:

Parameters:

Bucket capacity: Maximum number of tokens the bucket can hold.
Token refill rate: Rate at which tokens are added to the bucket (e.g., 1 token per second).
Request rate: Number of tokens required per request.

How it works: Tokens are generated at a fixed rate and placed into a bucket. Each incoming request consumes a token. If there are tokens available, the request proceeds. If the bucket is empty, requests are delayed or blocked. The refill rate ensures that the system can handle bursts of traffic by temporarily allowing extra requests.

Press enter or click to view image in full size

Token bucket visualized

Why use it? The token bucket is perfect for situations where you need to handle bursts of traffic, like when users submit multiple requests within a short period. It allows for burst behavior but limits the overall rate over time.

Real-World Use Case:
Imagine an online ticketing platform during a flash sale. Users might attempt to book tickets in bulk within a few seconds, creating a surge in requests. The token bucket ensures that the platform can handle the initial burst of requests but throttles back once the tokens are exhausted, preventing overload.

2. Leaky Bucket

The Leaky Bucket algorithm is similar to the token bucket but with a key difference in how traffic is handled. While the token bucket allows bursts and smooths out traffic over time, the leaky bucket enforces a more rigid output rate.

Get Tariq Massaoudi’s stories in your inbox

Join Medium for free to get updates from this writer.

Parameters:

Bucket capacity: Maximum number of requests the bucket can hold.
Leak rate: Fixed rate at which requests are processed (e.g., 10 requests per second).
Request arrival rate: Rate at which requests arrive at the system.

How it works: Requests are added to the bucket. If the bucket overflows (i.e., too many requests arrive), the excess requests are dropped. The leak rate controls how quickly requests are processed and ensures a smooth flow over time.

Press enter or click to view image in full size

Leaky bucker visualized

Why use it? The leaky bucket is great when you want to maintain a steady, consistent rate of requests. It’s less flexible than the token bucket but can be ideal for systems that need to avoid sudden spikes in traffic.

Real-World Use Case:
Consider a live-streaming service where users upload video content. You don’t want the server to be overwhelmed with too many concurrent uploads, so you regulate the rate at which uploads are processed. This ensures that while multiple users can upload content, the server doesn’t get bogged down by too many uploads at once.

3. Fixed Window Counter

The Fixed Window Counter algorithm is the simplest form of rate limiting. It tracks the number of requests within a fixed time window, and if the number of requests exceeds the threshold, further requests are blocked until the next window starts.

Parameters:

Time window**:** The time frame in which requests are counted (e.g., 1 minute).
Max requests per window: The maximum number of requests allowed within the time window.

How it works: The system tracks the number of requests made within a fixed time window (e.g., 1 minute). If the number of requests exceeds the limit during that window, the system blocks further requests until the next time window begins.

Press enter or click to view image in full size

Fixed window counter visualized

Why use it? This algorithm is ideal for applications where traffic is consistent and predictable. It’s simple and effective.

Cons:

One major downside of using the Fixed Window Counter is the spike in traffic at the edges of the window. For example, if a user makes 99 requests just before the end of the time window and then another 99 immediately after the window resets, it could result in 198 requests being processed within a very short time, much more than the allowed quota. This can cause unexpected load on the system.

Real-World Use Case:
Think of a public API for checking stock prices. Each user is allowed 100 requests per minute. If a user exceeds this limit, they can’t make further requests until the next minute. The fixed window is perfect for this case, where users are making regular requests at a steady rate.

4. Sliding Window Log

The Sliding Window Log algorithm provides more precision by tracking individual request timestamps within a sliding window. It ensures that requests are spread evenly across the time period, avoiding the burst behavior of the fixed window counter.

Parameters:

Time window: The length of the sliding window (e.g., 1 minute).
Max requests: The maximum number of requests allowed within the window.
Request timestamps: Track the exact time each request was made.

How it works: Requests are timestamped as they come in. The system tracks how many requests are made within the sliding window (e.g., the last 1 minute). The excess requests are blocked or delayed if the number of requests exceeds the allowed limit within the window.

Press enter or click to view image in full size

Sliding window counter visualized

Why use it? This algorithm is ideal when you need more granular control over request distribution across time. It ensures that requests are evenly distributed within the window, avoiding bursts at the beginning or end.

Real-World Use Case:
A mobile banking app allows users to make 10 transactions per day. With the sliding window log, the system ensures that the user doesn’t exceed the transaction limit, regardless of when the transactions are spread out across the day.

Rate limiters in the cloud

If you’re working with cloud platforms, there’s no need to reinvent the wheel. Both AWS and Azure offer built-in rate limiting features that are easy to integrate and scale.

AWS API Gateway: AWS offers built-in rate limiting for APIs. You can set limits on the number of requests per second, minute, or hour per user or API key. It also integrates with AWS Lambda for more advanced traffic management.
Azure API Management: Azure provides API Management, which allows you to enforce rate limits and quotas at the API level. You can define policies to throttle requests based on user or IP address, and scale these limits as needed.

Conclusion

To wrap things up, rate limiting is crucial for maintaining a smooth, fair, and secure system. Whether you’re dealing with burst traffic or protecting your backend from abuse, rate limiting helps you keep things under control. Of course, there are trade-offs; some algorithms are simpler but less flexible, while others offer more precision but come with added complexity. We’ve covered key algorithms like Token Bucket, Leaky Bucket, Fixed Window Counter, and Sliding Window Log, and seen how they fit different use cases. If you’re in the cloud, AWS API Gateway and Azure API Management offer powerful, managed solutions that take care of the heavy lifting. So, choose the right algorithm or service for your needs, and you’ll have a system that handles traffic efficiently and scales with ease. Thanks for reading, and I hope this article has given you the insights you need to tackle rate limiting in your projects.

The Definitive Guide to Choosing a Storage Solution: Matching Your Data to the Right Architecture

Fri, 09 May 2025 22:12:03 GMT

Introduction:

If you’re building any kind of system, whether it’s a web app, a big data analysis dashboard, or an enterprise backend, you’ll eventually face this question: Where do I store my data?

This article will be your practical guide for navigation that choice. It’ll walk you through how to determine the right storage solution based on your data type, access patterns, and use case. It will also highlight real world tools as examples on how you would implement such storage solution from cloud providers like AWS, Azure, etc …

What’s the structure of your data?

The first and most critical question you must ask is what kind of data are we dealing with?

Structured Data: Well-defined rows and columns. Think of tables with strict schemas. e.g., customer info, product inventories, financial transactions.
Semi-structured Data: JSON, XML, YAML. Has structure but doesn’t fit neatly into columns.
Unstructured Data: Images, videos, audio files, documents. No inherent structure.

If You Have Structured Data:

Use Case: OLTP (Online Transaction Processing)

If you’re building a web application or API where users are constantly interacting with the system — logging in, creating accounts, placing orders, updating settings, then you’re working in OLTP territory. These are read/write-heavy operations that need to be fast, reliable, and consistent.

Think about apps like an e-commerce platform where someone adds items to their cart and checks out, a social media platform where users update their profile…

In this case, it’s recommended to use a relational database. These are great at enforcing structure (schemas), ensuring consistency (ACID compliant), and handling concurrent operations safely.

Common cloud solutions you’d use here:

AWS RDS (supports MySQL, PostgreSQL, etc.): great for managed production environments
Azure SQL Database: scalable and integrates well if you’re in the Microsoft ecosystem
Google Cloud SQL: pairs nicely with App Engine or GKE for app backends

Use Case: OLAP (Online Analytical Processing):

When your focus shifts from handling transactions to analyzing them — looking for trends, generating reports, building dashboards — you’re now in OLAP territory. This is where you’re less concerned with updating data and more focused on scanning large volumes of it quickly and efficiently.

Think about scenarios like: A product manager exploring daily sales by category over the past year or a dashboard that shows real-time KPIs across regions, products, and timeframes.

These use cases often involve aggregations, filters, and joins on massive datasets. The workloads are read-heavy, and they often run on scheduled pipelines or are triggered by end-user dashboards.

In this case, it’s recommended to use a columnar database. These are designed specifically for analytical queries — they store data by column rather than by row, which makes operations like filtering and aggregating much faster, especially when only a few fields are queried at a time.

Common cloud solutions you’d use here:

Amazon Redshift: a solid choice for batch-based analytics at scale
Google BigQuery: serverless, fast, and integrates well with other GCP tools like Dataflow or Looker
Azure Synapse Analytics: great if you’re already on Azure and want hybrid support for structured and semi-structured data

If You’re Dealing with Semi-Structured Data (JSON, XML, Logs)

Use case: In-memory caching:

Let’s say you’re storing session data, API tokens, or configuration values that need to be accessed frequently and with extremely low latency. In-memory caching is a classic solution for this, especially when you care more about speed than durability.

In this case, it’s recommended to use an in-memory key–value store. Tools like Redis or Memcached offer blazing-fast access and simple key-based lookup. They’re also widely supported across cloud providers, Amazon ElastiCache and Azure Cache for Redis make setup and scaling relatively seamless.

Use Case: Document-Oriented Data Access:

Consider an application where you’re storing complex, nested user profiles, product catalogs, or blog posts that vary in structure. Each object might have subfields, embedded lists, or optional sections. Trying to normalize this into relational tables would not only be tedious but also reduce flexibility and performance.

In this case, it’s recommended to use a document database. Systems like MongoDB Atlas or Amazon DocumentDB are purpose-built for storing and querying nested JSON objects. They support indexing on nested fields and let you query or update individual paths inside a document. If you’re already serverless or mobile-heavy, Firebase Firestore or Azure Cosmos DB might be a better fit.

Use Case: Relationship-based querying:

Now let’s imagine you’re working on a system where understanding relationships is key, maybe it’s a social network, a fraud detection tool, or a recommendation engine. Users expect the system to surface meaningful connections across entities: who knows whom, what interacts with what, or how things are related through multiple hops.

These workloads involve traversals, pattern matching, and recursive relationships, operations that traditional relational databases often struggle to express or optimize.

In this case, it’s recommended to use a purpose-built graph database. Neo4j, Amazon Neptune, and Azure Cosmos DB (with Gremlin API) are well-suited for modeling and querying complex, interconnected data. They can also power recommendation engines, identity resolution tools, and knowledge graphs with real-time performance.

Use Case: Keyword-based text search:

Now let’s imagine you’re working on a search experience, maybe for a news website, an internal tool, or a product catalog. Users expect to find what they need quickly, even if they mistype something or use synonyms.

These workloads involve fuzzy matching, ranking, tokenization, and stemming — all operations that typical databases don’t handle efficiently.

In this case, it’s recommended to use a dedicated search engine. Elasticsearch, Amazon OpenSearch, and Azure Cognitive Search are well-suited for full-text indexing. They can also power advanced filtering and faceted search interfaces.

If You’re Dealing with Unstructured Data :

Use Case: File and Media Storage

Suppose you’re building a platform that lets users upload profile photos, download PDFs, or stream video and audio. These files don’t need to be interpreted by a database — they just need to be stored, versioned, and served efficiently.

Think of a learning platform hosting lecture videos, or an HR system storing CVs and scanned contracts. You’re not querying the contents directly, you just want a reliable way to store and retrieve the files.

In this case, it’s recommended to use object storage. Services like Amazon S3, Azure Blob Storage, and Google Cloud Storage are optimized for durability, availability, and low-cost archival. They also support metadata tagging, version control, and lifecycle policies. Most modern SaaS platforms rely on object storage behind the scenes to manage files at scale.

Use Case: Large-Scale Text Analysis and Embedding

Now imagine you have thousands of documents — emails, customer reviews, support tickets, or legal contracts — and you want to search, classify, or summarize them. These aren’t nicely structured fields; they’re raw text, often messy and long.

Let’s say your product team wants to analyze sentiment from open-ended feedback, or legal wants to extract entities from scanned documents. This requires semantic understanding, keyword extraction, and often vector search.

In this case, it’s recommended to preprocess the data into embeddings and store them in a vector database. Tools like Pinecone, Weaviate, Qdrant support fast similarity search based on meaning rather than keywords. This is the foundation of AI-enhanced search, recommendation engines, and retrieval-augmented generation (RAG) for LLMs.

Use Case: Data Lake for Large-Scale Analytics:

If you’re storing massive volumes of raw, unstructured data — like logs, images, audio files, or telemetry streams — and want the flexibility to analyze it later, you’re in data lake territory.

This is different from basic object storage. While object stores like S3 or Azure Blob are great for storing files, a data lake layers on metadata, cataloging, and schema-on-read features, so you can query and process that data at scale.

Think of use cases like a retailer collecting clickstream data, or a utility company storing IoT sensor feeds. In this case, it’s recommended to use a data lake engine such as Databricks, Delta Lake, or Snowflake, especially if you’re planning downstream analytics, ML workflows, or need compliance features like audit logs and fine-grained access control.

Summary flowchart:

Conclusion:

There’s no one-size-fits-all storage. It all comes down to how your data looks and how you plan to use it.

Match your storage to the shape and velocity of your data, and you’ll avoid both overengineering and costly bottlenecks.

Thank you for reading, and hope this article has been insightful and useful to you.

Caching for Mortals: What You Actually Need to Know

Mon, 28 Apr 2025 22:12:03 GMT

A tasty introduction

Imagine you’re building a hot new recipe app that suddenly goes viral because of your revolutionary new tagine recipe. Your server is now bombarded with requests from thousands of hungry users desperately seeking the perfect tagine. Your database is sweating, your CPU is screaming “I CANT HANDLE THIS” and your cloud bill is climbing! Your application has become so slow that users have enough time to prepare couscous while waiting for the page to load.

Sounds familiar? (Maybe not the food part, but the performance crisis might ring a bell)

This is where caching enters the chat. Caching is like that efficient friend who remembers everyone’s coffee order so the whole group doesn’t have to recite their complicated requests every single time. In the world of computing, it’s a technique that stores frequently accessed data in a temporary location for quicker retrieval, saving your precious resources from doing the same work over and over again.

In this article, we’ll break down caching concepts into practical, actionable insights. We’ll explore when to use different caching techniques, how to implement them effectively. Whether you’re a junior developer trying to optimize your first production app or a seasoned engineer wanting to refresh your knowledge, this guide will give you the tools to make informed decisions about caching. So let’s dive in and demystify caching for mere mortals!

The Why: Benefits of Caching

Imagine if every time someone searched for your popular bastilla recipe, your server had to recalculate the preparation time, re-query the database for ingredients, and recompute the nutritional information. This is very inefficient, it’s like a chef forgetting to make pizza after every single customer! Caching aims to solve that.

Here’s what it brings to the table:

Fast response time: With cached data, your users get their recipe in milliseconds instead of seconds.

Dramatic reduction in server load: Your database was previously processing 5,000 identical queries per minute. With caching, that number drops to maybe 50.

Significant cost savings: Fewer server resources mean lower cloud bills.

Enhanced user experience: Studies show that users abandon websites that take more than 3 seconds to load. Caching helps keep your bounce rate low and your user satisfaction high.

Caching Fundamentals: The Building Blocks

Why it’s effective: Memory vs Disk:

Now that we understand why caching is useful, let’s break down how it works.

Here’s your app without caching

Here’s your app with caching enabled

One fundamental point is that caching is faster because memory (RAM) is way faster than Disk (HDD or SSD), but the tradeoff is that RAM is way more expensive

Cache Hit vs. Cache Miss:

Let’s continue explaining how it works,

When your application looks for data in the cache, one of two things happens:

Cache Hit: “Eureka! Found it!” Your app found what it needed in the cache. This is the equivalent of finding your keys exactly where you left them. The data is served immediately, and everyone’s happy.
Cache Miss: “Uh-oh, not here.” The data isn’t in the cache, so your app has to take the scenic route to the database, fetch the data, store it in the cache for next time, and then return it and typically populate the cache so that next time we get a cache miss.

Cache Eviction: Making Room for New Stuff

Just like that milk that was perfect yesterday but is questionable today, cached data has a shelf life. Here are some strategies that you can apply when the cache is full.

Least Recently Used (LRU): “Haven’t used that recipe in weeks? Out it goes.” Discards the least recently accessed items first when the cache is full.
Least Frequently Used (LFU): “Nobody’s looking at the kale recipes anymore.” Tracks popularity and dumps the least frequently accessed items.
First In, First Out (FIFO): “Oldest items exit first.” Simple but doesn’t account for item popularity.

Cache Invalidation: The Hard Part

There’s a famous quote in computer science: “There are only two hard things in Computer Science: cache invalidation and naming things.” Here are some popular strategies along with their use cases:

Time-Based Expiration (TTL): “If it’s been here too long, it’s probably bad.” You put a timer on your cache entries. Once the clock runs out, they’re tossed. This works great for stuff like API responses or session tokens, where being a little out of date isn’t the end of the world. It’s super simple to set up , just tell the cache how long to keep things. The downside? You might end up serving stale data if something important changes before the timer runs out.
Write-Through Cache: “If it’s important enough to save, it’s important enough to update.” Every time something gets written to your database, it also gets written to the cache right away. This keeps the two perfectly in sync, making it perfect for things like shopping carts or user profiles, where you want instant consistency. The catch is that it slows down your writes, because now you’re hitting two systems at once. Plus, if your cache ever goes down, you’re in for a bad day.
Write-Around Cache: “Save it quietly. We’ll deal with it later.” When you update something, you skip the cache entirely and just hit the database. The cache only gets involved when someone tries to read the data later. This is great for write-heavy systems like logging apps, where most of the stuff written never gets looked at again. It keeps your cache cleaner, but the first read after a write is slower, because the cache has to scramble to catch up.
Write-Back (Write-Behind) Cache: “We’ll get to it… eventually.” Instead of writing to the database right away, you dump the data into the cache first and let the cache figure out when to push it back to the database. This makes writes lightning-fast, which is ideal for things like collecting sensor data or heavy logging. Just be warned , if your cache crashes before syncing back to the database, your data could vanish into the void.
Manual Invalidation: “You break it, you clean it.” When your app knows that something has changed, it takes responsibility and manually deletes or updates the cache entry. This is the go-to strategy for precision-demanding systems like content management platforms and real-time dashboards. It guarantees your cache always stays correct, but it also means you need tight, careful code.
Event-Based (Pub-Sub) Invalidation: “Spread the word: it’s outdated!” Instead of manually trying to keep caches updated, you set up a system where any change to the data fires off an event. All the caches that care about that piece of data listen for the event and update themselves accordingly. This keeps things snappy and coordinated across huge distributed systems. Of course, now you have to run and monitor an event system, which can get complicated fast.

Deep dive into the LRU algorithm

Next up we will talk about one of the most popular eviction algorithms LRU from an architectural standpoint:

LRU (Least recently used) is pretty human if you think about it, if you don’t interact with a person for a long time, you tend to forget them. That’s basically how it works, when the cache is full, discard the latest recently accessed item.

To implement LRU effectively, we need two key operations to be fast:

Retrieving an item by its key
Tracking and updating the “recently used”

This creates an interesting challenge: Hash tables are great for key-based lookups (O(1) time) but don’t maintain order. Linked lists are perfect for maintaining and modifying order but terrible for lookups.

The solution? A hybrid approach using both:

A hash map (dictionary) for O(1) lookups
A doubly-linked list for tracking access order, the head of the list represents the most recently used item while the tail represents the least recently used item.

Process Flow

Accessing Data

When a cache entry is accessed, it gets moved to the most recently used position in the list (the head).

This ensures that the most frequently accessed items stay at the front, and the least frequently used items are pushed to the back.

Adding Data

When adding a new entry, it’s inserted at the most recently used position (the head).

If the cache has reached its capacity, the least recently used entry (the tail of the list) is evicted to make space.

Cache in the Real World: Redis and Memcached

In real production systems, you’re not usually hand-building your cache from scratch. Instead, you lean on powerful, battle-tested tools like Redis or Memcached.

Both Redis and Memcached are in-memory key-value stores used for caching, but they have slightly different philosophies:

Memcached is a lightweight, pure caching layer. Think: simple key-value, no persistence, no rich data structures.
Redis is an in-memory data structure store — it can cache, but it can also persist to disk, replicate data, and even act like a mini-database.

Conclusion

To sum up, caching reduces load and latency by keeping key data in fast memory instead of repeatedly hitting slower backends. Of course, caching is all about trade‑offs — you gain speed and cost savings at the expense of added complexity, memory use, and potential data staleness. We’ve covered cache hits versus misses, eviction policies (LRU, LFU, FIFO), invalidation methods (TTL, write‑through, pub‑sub), and real‑world tools like Redis and Memcached. Start by caching your heaviest queries with a simple cache‑aside pattern, then measure and refine for optimal performance. Thank you for reading, and hope this article has been insightful and useful to you.

I Read AI Engineering by Chip Huyen — Here's What Stuck With Me

Thu, 20 Feb 2025 12:00:00 GMT

I’ve been building AI features into production systems for a while now. Like most engineers in this space, I picked things up as I went — a blog post here, a YouTube tutorial there, a lot of trial and error. It worked, but I never had a clear mental model of the full picture. I knew pieces, not the system.

Then I picked up AI Engineering: Building Applications with Foundation Models by Chip Huyen (O’Reilly, 2025). I wish I had read it a year earlier. Not because it taught me entirely new things — some of it I already knew from experience — but because it organized everything into a framework that finally made sense. It connected the dots between evaluation, prompt engineering, RAG, agents, finetuning, and production architecture in a way no blog post ever did.

Here are the ideas from the book that changed how I think about building AI applications.

AI Engineering Is Not ML Engineering

This distinction seems obvious in hindsight, but the book makes it explicit. Traditional ML engineering is about collecting data, training models, and deploying them. You own the entire pipeline from data to weights. AI engineering is different: you’re building on top of foundation models that someone else trained. Your job shifts from model creation to model adaptation.

In ML engineering, the competitive advantage was in data labeling, feature engineering, and model architecture. In AI engineering, everyone has access to the same models through APIs. The moat is in context engineering, evaluation on your own use case, and user experience. The question becomes: how do I get the best results out of these models for my specific problem?

Huyen breaks the AI stack into three layers: application development (prompts, context, UX), model development (finetuning, dataset engineering), and infrastructure (serving, compute, monitoring). Most of the work happens at the top layer. You start there and only move down when you need to.

Evaluation Is Everything

If there’s one theme that runs through the entire book, it’s this: evaluation is everything. And it’s the part most teams get wrong or skip entirely.

Evaluating traditional software is straightforward — either the function returns the expected output or it doesn’t. Evaluating an LLM’s output is messy. The output is open-ended, subjective, and probabilistic. The model might give a different answer each time.

The book introduces Evaluation-Driven Development (EDD), inspired by TDD. The idea is to define your evaluation criteria before you start building. What does a good response look like? What does a bad one look like? Write rubrics, create scoring guidelines, provide examples. If you don’t do this, you’re basically guessing whether your system is getting better or worse with each change.

In practice there’s a spectrum of evaluation methods. Functional correctness is the gold standard when you can use it — if you’re generating code, run it against test cases. Similarity to references works when you have ground truth, using lexical overlap (BLEU, ROUGE) or semantic similarity via embeddings. LLM-as-a-judge is increasingly popular for subjective evaluation — you use a strong model to score the output of another model. It’s scalable but comes with real limitations: self-bias, position bias, and verbosity bias. Despite those flaws, it’s still useful when combined with other methods.

The practical takeaway: define your evaluation criteria before you write a single prompt. If you care about something — factuality, tone, format, safety — put an evaluation on it.

RAG: Facts vs Form

RAG (Retrieval-Augmented Generation) gets its own deep treatment in the book, and rightfully so. The core idea is simple: before the model generates a response, retrieve relevant information and include it in the context.

Some people think that as context windows grow longer (200K+ tokens now), RAG will become unnecessary. Huyen argues the opposite, and I agree: data always grows faster than context windows. You’ll never fit everything into context, so you’ll always need intelligent retrieval.

The phrase from the book I use all the time now: “RAG is for facts, finetuning is for form.” If your model needs to know specific, up-to-date information — use RAG. If your model needs to adopt a specific style or behavior pattern — consider finetuning. Most applications need RAG first. Finetuning is expensive, can become outdated when the base model updates, and should only be pursued after you’ve maximized what prompting and RAG can do.

Agents Are Powerful but Fragile

The agents chapter is where the book gets exciting. At its core, an agent is just an LLM that can perceive its environment and act on it through tools. ChatGPT browsing the web, a coding agent running terminal commands, a customer support bot querying a database — these are all agents.

A key principle that maps directly to my experience with coding agents: decouple planning from execution. Let the model generate a plan first, validate that plan, then execute it step by step. Blindly letting a model plan and execute simultaneously is how you get agents that go off the rails. This is the same pattern I follow daily — I always ask for a plan first, review it, then let the agent implement.

Here’s the math that makes this concrete: each step in an agent’s plan is a potential point of failure, and errors compound. A five-step plan where each step has 90% accuracy gives you only about 59% overall success. This is why keeping agent plans simple and providing verification at each step matters so much.

The book also covers multi-agent patterns — routers that classify and delegate queries, sequential chains where each agent processes the previous output, supervisor agents that orchestrate sub-agents, and parallel execution for independent subtasks. These are worth knowing, but the core lesson is simpler: more steps = more failure points. Keep it tight.

The Data Flywheel

One framework from the book I keep thinking about is around competitive advantage. When building AI products, the barrier to entry is low. If it’s easy for you to build something with an API, it’s easy for anyone else too.

Huyen identifies three potential moats: technology, data, and distribution. With foundation models commoditizing the technology layer and big companies owning distribution, the most sustainable moat for most teams is data. Specifically, the feedback loop: ship fast, collect user interactions, use that data to improve the product, attract more users, collect more data. This flywheel is what separates products that keep getting better from those that stagnate.

This means your feedback collection design matters enormously. Explicit feedback (thumbs up/down) is sparse and biased. Implicit feedback (conversation continuation, task completion, abandonment) is noisy but abundant. Designing how you extract signal from user interactions is an underrated skill.

What Stuck With Me

After reading the book and continuing to build AI features in production, here are the frameworks that stuck:

Evaluation first. Before I write prompts, I write evaluation criteria. Before I change a model or a pipeline component, I make sure I can measure whether the change is an improvement.
RAG before finetuning. Every time someone suggests finetuning, I ask: have we exhausted what we can do with better retrieval and better prompts? The answer is almost always no.
Start simple, add progressively. Don’t try to build the perfect system from day one. Start with a good prompt and RAG. Evaluate. Then add complexity where the metrics tell you to.
Agents are powerful but fragile. The more steps in your agent’s plan, the more points of failure. Decouple planning from execution. Verify at each step.
Context engineering is the skill. Not prompting. Context engineering. That includes what information you retrieve, how you structure it, what goes at the beginning vs. the middle, and how much you include. This is where the craft is.

If you’re building anything with foundation models, this book is the best single resource I’ve found. The specific tools and models will change, but the principles are durable. Read it, then re-read the evaluation chapters, then go build your eval pipeline.

I Use Coding Agents Daily: Here's What Works

Wed, 22 Jan 2025 12:00:00 GMT

Introduction:

Agentic coding is less about “letting the AI code” and more about how you set it up for success. Treat coding agents like junior engineers: give them clear goals, strong constraints, the right tools, and a way to validate their work. This article summarizes practical lessons and patterns that have worked for me when using modern agentic coding tools in real projects.

Variables that affect the quality of the output:

When you’re interacting with a modern coding agent, you’re can choose the underlying model, the content of the message you send the agent which is the initial context, and the tools provided to the agent, each of these variables are important to the output.

Put the most effort in planning:

For most agentic tasks with the exception of trivial and very clear bug fixes or documentation, I’d recommend to spend the most time and effort on crafting a clear plan for the agent before implementation, most agenools offer a plan mode that you can use. For complex problems that I’m not sure about the right solution, I like to start by an exploratory or brainstorming prompt, it’s important to give agent a clear path to the solution when possible so that the agent doesn’t guess. Here’s an example of an exploratory prompt structure I like to use:

As a (domain expert)
Given this problem:
(your problem)
propose multiple solutions that respect (your best practices or constraints here)
Rank these solutions while providing detailed reasoning and tradeoffs.
Recommend the best solution
(tag the relevant files or folders here)

Context is king:

Just like humans, coding agents perform best when they have the right information, and they get less smart the more their context fill up, modern current models have around 200K window. Reference study by chroma

Context engineering is a very important skill to get the best of coding agents, here’s some tips and what worked for me:

One task, one session, after each task start a new chat.
If the feature is too big, ask the agent to split the plan into phases, execute each phase in a new session, verify the output of the phase then move to the next phase.
When running out of context ask the agent to create a handover markdown document, with work done and learnings, pass it to another agent to continue the work.
When possible provide the agent with the exact files relevant to the task to prevent that the agent explores the codebase wasting time and tokens.

Give the agent a way to verify it’s work:

Without a way to verify it’s work the agent is basically guessing, it might one shot your task or you might have to verify it’s work manually and iterate with it. If you give it a deterministic way to verify the work, it will guess verify and if wrong rethink the approach until the task is correct.

In practice ask the agent to write tests and verify the code against them, for backend work I found asking the agent to run the backend server and test the endpoint live to be effective.

Frontend tasks are more complex to verify, you can use playwright MCP or Claude Chrome extension, but it might be unreliable, the next best thing is to ask the agent to add debug logs and copy it back to the agent if something goes wrong.

The right model for the right task:

For planning, I always use the current best model which is Claude Opus 4.5, some people have had success with GPT 5.2, for executing the plan using the next tier of models such as Claude Sonnet is often enough, as long as the plan is detailed enough. For simple tasks such as committing, writing pull requests you can choose the smallest fastest model for example Claude Haiku or Gemini Flash.

When to use MCP:

The drawback of using MCPs is the context cost since they store the tool descriptions in context and you have to remember to disable the MCP server after use. If the service you’re interacting with provides a CLI tool that accomplishes same task as MCP (an example here is github cli, azure cli) just ask the model to use the CLI instead.

Slash commands:

Slash commands are shortcut prompts for common tasks, they’re extremely useful, I mainly use it for committing, pushing and creating pull requests. Example command to commit and push:

1. First, run git diff to see all changes (both staged and unstaged)
2. Analyze the diff to understand what changed
3. Write a conventional commit message based on the diff:
   - Use format: type(scope): description
   - Types: feat, fix, docs, style, refactor, test, chore
   - Keep the first line under 72 characters
   - Add a blank line and bullet points for details if needed
4. Stage all changes with git add -A
5. Commit with the conventional commit message
6. Push to the remote branch. If the branch has no upstream, set it with
   git push -u origin <branch>

Global rules files

claude.md, cursor rules are a must have to establish your dos and don’t, coding style, etc .., and can be helpful to constrain the agent but they’re not hard rules, expect the agent to ignore them sometimes. Here’s a resource to find common rules for your stack.

You must review the agent output manually. Another helpful pattern is to have another agent that you provide with your quality metrics review the output of the first agent this will help you quickly find any red flags.

Conclusion:

Agentic coding works when you treat it like managing a junior dev: clear tasks, good context, and proper verification. The fundamentals won’t change as tools evolve, planning matters more than prompting, context engineering beats brute force, and review is non-negotiable.

Start small, build your own patterns, and remember: you’re still the engineer. The agent just moves faster than you type.

From Hacky Scripts to Professional Code: A Guide to Crafting High-Quality Python Projects

Fri, 04 Oct 2024 22:12:03 GMT

Introduction:

Imagine it’s late at night and you’re working on a python script that just has to work. It started off as a simple idea, just automate this one thing, scrape this piece of data and you’re done! As you intuitivly add more features, a few extra lines turns into a hundred and before you know it the script has grown into an unmanagable mess ! A tangled mess of dependencies, random formating and a small change that risks to break everything else.

Sound familiar?

This senario plays out for developpers accros the world, whether they’re just starting out with Python or juggling multiple projects that envolved without proper structure. Thus the need to start out your project right to create something maintainable, sharable and scalable that others could easily work on !

In this article, we’re going to explore how a few key tools and approaches can elevate your Python projects to a professional standard: automatic code formatting with Black, code linting to ensure quality, dependency management using Poetry, and the power of Makefiles to simplify everyday tasks.

How to Make the Best Use of This Article 📋

This of this article as checklist for improving your Python projects, covering everything from dependency management to automated testing.

The article provides external ressources to dive deeper into each tool or topic.

What It Is:

A practical guide for leveling up your Python projects.
A starting point for tools that streamline development.

What It’s Not:

A deep dive into each tool’s advanced features.
A one-size-fits-all solution you don’t need every tool! please adapt it to your needs!

Dependency Management Made Easy with Poetry 🛠️

If you’ve ever worked with pip and requirements.txt, you’ve likely run into issues like version conflicts, missing packages, or struggles to replicate environments. Poetry solves these problems by maintaining a single source of truth for your project’s dependencies using the pyproject.toml file, making it easier to:

Install dependencies consistently across machines.
Manage both development and production dependencies.
Keep your project reproducible by pinning exact versions.

Getting Started with Poetry

Install Poetry

curl -sSL https://install.python-poetry.org | python3 -

Initialize Your Project

poetry init

This command walks you through setting up your pyproject.toml, where all your dependencies are stored.

Add Dependencies:

poetry add fastapi

This installs FastAPI and updates your pyproject.toml and poetry.lock. For development dependencies like linters or testing tools, use:

poetry add --dev black

You can install all dependencies of a particular project using:

poetry install

This installs everything in poetry.lock

Virtual Environments: The Power of Isolation 🐍

If you’ve ever juggled multiple Python projects, each requiring different libraries or even different versions of Python. You’ve probably ran into issues with dependency conflicts or global installations breaking!

This is where virtual environments become a developer’s best friend — they allow each project to have its own isolated setup, free from the chaos of conflicting versions.

Pyenv: A Solution for Managing Multiple Python Versions

Pyenv allows you to install and switch between different Python versions effortlessly, right from your terminal.

Example Scenario with Pyenv:

Imagine you’re working on a new project that needs Python 3.10 for its features, but you have another project stuck on Python 3.8. Let’s solve this issue with Pyenv

Install Pyenv: First, install Pyenv with a simple command:

curl https://pyenv.run | bash

Install Multiple Python Versions: Use Pyenv to install Python 3.8 and Python 3.10

pyenv install 3.10.0  
pyenv install 3.8.10

Switching Between Versions: To set Python 3.10 globally, run:

pyenv global 3.10.0

Formatters, Linters and Beyond 🧼

Ensuring code quality is one of the most critical steps in building a professional-grade Python project. Formatters and linters and type checkers automate this process, helping you maintain consistency, catch bugs, and enforce best practices. In this section, we’ll explore four essential tools to help with this: Black, Flake8, isort, and Mypy.

Black for Code Formatting 🖤

Black is an opinionated code formatter that takes care of all the stylistic choices in your code. Instead of wasting time debating code styles or manually reformatting code, Black automatically does that for you! With just a single command, your Python code gets a uniform look, making it easier to read and maintain.

For example, here’s a before and after comparison of code formatted by Black:

Before:

def add_numbers(a,b): return a+b

After:

def add_numbers(a, b):  
    return a + b

Black follows the PEP 8 style guidelines for python, refer to the guide here

you can use Black after installing it with pip from the command line:

black folder_needs_fomatting

You can also install it into VS code and set the editor to apply black whenever you save a python file which is the most convenient method.

check this guide for instructions.

Fine the black documentation here. Alternatives to black include YAPF (Yet Another Python Formatter), Autopep8.

Linting with Flake8 🔍

While Black focuses on formatting, Flake8 takes care of code quality by detecting common issues such as unused imports, undefined variables, and style violations. It helps you identify potential bugs early, making your code cleaner and more reliable.

For example, Flake8 might flag the following code:

def calculate_total():  
 return total # undefined variable

Flake8 would catch that total is used before being defined, preventing a runtime error later.

It is also advisable to set it up with VS code. Check this guide for instructions.

Alternatives to flake8 include Pylint.

Sorting Imports with isort 📦

In larger projects, keeping your imports organized is crucial for readability and maintainability. This is where isort comes in. isort is a tool that automatically sorts your imports, grouping them into logical sections and ensuring that they are in the correct order.

Get Tariq Massaoudi’s stories in your inbox

Join Medium for free to get updates from this writer.

Before isort:

import os  
import requests  
from django.shortcuts import render  
import sys  
from .models import Product  
import json

After isort:

import json  
import os  
import sys  
  
import requests  
from django.shortcuts import render  
  
from .models import Product

With isort, standard library imports, third-party dependencies, and local application imports are neatly separated, following Python’s best practices.

Type Checking with Mypy 🧠

In addition to formatters and linters, Mypy adds static type checking to your Python code. Mypy helps you catch type-related bugs before they even occur by checking the types of variables, function arguments, and return values against the expected types.

For instance, Mypy would catch the following type mismatch:

def add_numbers(a: int, b: int) -> int:  
    return a + b  
  
add_numbers("1", 2)  # Mypy will flag this!

For seamless development, you can also configure Mypy with VS code

Learn more in the Mypy documentation.

Introduction to Software Testing with Pytest 🧪

How can you be sure that your code does what it’s supposed to — and keeps working even as you add new features or make changes? This is where software testing becomes essential. Testing not only confirms that your code works right now, but also gives you the confidence that it will keep working and not break as your project evolves.

Writing a Simple Test

Suppose you have a function that adds two numbers:

def add_numbers(a, b):  
    return a + b

Now, let’s write a test for it using Pytest:

def test_add_numbers():  
    assert add_numbers(2, 3) == 5  
    assert add_numbers(-1, 1) == 0

To run the test, just execute pytest in your terminal, and Pytest will find and run all your test cases automatically.

Beyond Basics: Advanced Testing Topics

Once you’re comfortable with basic testing, Pytest offers advanced tools to take your testing to the next level:

Test Coverage: Ensure that all parts of your code are being tested by measuring test coverage. Tools like pytest-cov help you identify untested parts of your project.
Parameterized Tests: Run the same test with multiple inputs to catch edge cases without repeating code.
Fixtures: Simplify complex test setups by using fixtures to manage dependencies, like database connections or file structures.

These tools can make your tests more efficient and thorough, ensuring your code is rock-solid and ready for anything. For more on these advanced features, check out the Pytest documentation.

The Power of Makefiles: Automating Your Workflow ⚙️

As your Python projects grow, you’ll notice a pattern: running the same commands repeatedly, whether it’s for testing, linting, formatting, or even just launching your application. Manually typing out these commands each time can become tedious.

Makefiles allow you to define a series of commands in a file (Makefile), which can then be executed with a single, memorable command: make.

The Structure of a Makefile

A Makefile consists of rules, which are written in the format:

target: dependencies  
    command

Target: This is the name of the task you want to run. It can be anything you choose, like format, test, or build.
Dependencies: These are files or targets that must be up-to-date before the current target runs. While they are more commonly used in software compilation, in Python projects, we don’t usually use them unless specific files must be checked before a command runs.
Command: This is the shell command to execute when the target is called. Commands must be indented with a tab, which is a common source of errors when writing Makefiles.

Makefile through an example

Let’s walk through an example. Suppose your project frequently requires the following tasks:

Formatting your code with Black.
Linting your code with Flake8.
Running tests with Pytest.

Create a File Named **Makefile** in the root directory of your project. It should have no extension.

all: format lint test  
  
format:  
    black .  
  
lint:  
    flake8 .  
  
test:  
    pytest

Here, the all target runs format, lint, and test in that order. When you type make all, all three tasks are executed.

For a more in-depth guide check this article

CI/CD: Automate Testing, Formatting, and Code Quality 🚀

With your code formatted, tested, and linted, how can you ensure that every change is consistently checked before merging into your project? That’s where Continuous Integration (CI) and Continuous Deployment (CD) come in.

Continuous Integration (CI): Every time you or your team pushes new code, CI automatically runs your tests, linting, and formatting checks.

Continuous Deployment (CD): Once your code passes all the CI checks, CD takes over by deploying it automatically to your production or staging environment.

CI/CD ensuring every code change is consistently verified before merging. This prevents bugs and keeps your project clean.

Example: CI Pipeline with GitHub Actions 🛠️

Create a .github/workflows/ci.yml file in your project and add the following configuration:

name: CI Pipeline  
  
on:  
  push:  
    branches:  
      - main  
  
jobs:  
  test:  
    runs-on: ubuntu-latest  
    steps:  
      - uses: actions/checkout@v2  
      - uses: actions/setup-python@v2  
        with:  
          python-version: '3.x'  
      - run: pip install poetry && poetry install  
      - run: poetry run black --check .  
      - run: poetry run flake8  
      - run: poetry run pytest

This pipeline runs Black, Flake8, and Pytest on each push to main

For more, check out GitHub Actions docs.

Refactoring and Clean Code Practices: Beyond Automation 🧹

While tools like Black and Flake8 help you automate formatting and linting, automation can only take you so far. Clean, maintainable code isn’t just about fixing syntax issues , it’s about writing code that humans can understand and improve over time.

Refactoring in Action

Let’s say you have a function that works but could be cleaner:

Before:

def process_data(data):  
    result = []  
    for item in data:  
        if item['age'] > 18:  
            result.append(item['name'].upper())  
    return result

After Refactoring

ADULT_AGE = 18  
  
def is_adult(person):  
    return person['age'] > ADULT_AGE  
  
def get_name_uppercase(person):  
    return person['name'].upper()  
  
def process_data(data):  
    return [get_name_uppercase(person) for person in data if is_adult(person)]

The code is now split into small, meaningful functions with clear names.

For more tips on refactoring, check out this refactoring guide.

Here are some key clean code practices:

Keep Functions Small: Break your code into bite-sized, single-purpose functions.
Use Descriptive Names: Good names make code self-explanatory, reducing the need for comments.
Avoid Repetition: Stick to the DRY (Don’t Repeat Yourself) principle refactor duplicate code into reusable functions.

When you combine good refactoring with clean code principles, your projects become easier to maintain and scale. To dive deeper, explore this guide to writing clean code.

Conclusion

To sum up, by adopting these tools and practices, you can transform your Python projects into clean, maintainable, and professional-grade. Whether it’s managing dependencies with Poetry or automating tests with CI/CD, each step saves you time and headaches in the long run!

Thanks for reading, and I hope this guide helps you on your journey to building better Python projects! Feel free to reach out on LinkedIn if you have any questions or want to chat more.

Unraveling the Mysteries of the Mind: A Journey Through 20 Psychological Principles

Mon, 04 Dec 2023 22:40:32 GMT

Introduction:

Imagine you’ve just binge-watched an enthralling new TV show. The characters, the plot twists, the dialogues — they’re all fresh in your mind. Then, as if by some twist of fate, you start noticing references to this show everywhere — in conversations, on social media, even in casual remarks from your colleagues. Is this mere chance, or is there something more to this pattern? Dive with us into the fascinating realm of psychological principles and uncover how they subtly influence our perceptions and daily experiences.

1. The Baader-Meinhof Phenomenon: The Illusion of Frequency

Ever mentioned a quirky, seemingly rare vintage car and then spotted it everywhere? That’s the Baader-Meinhof Phenomenon in action. It’s like our brain, the ultimate pattern-recognition machine, suddenly puts a spotlight on what was always there. It’s a quirky reminder of how our perception can paint a skewed picture of reality. Next time this happens, take a beat to think: where else might my brain be playing this trick on me?

Practical Application: When you notice this phenomenon, pause and consider other areas in life where your perception might be creating a false narrative of frequency or importance.

2. The Dunning-Kruger Effect: The Peak of Mt. Stupid

Remember when you first tried cooking a complex dish and thought, ‘Hey, I’m pretty good at this’? Only to realize later that your masterpiece barely scratched the surface? Welcome to the Dunning-Kruger Effect — a humbling journey from the ‘peak of Mt. Stupid’ to the valleys of ‘I have so much to learn.’ It’s a nudge to keep learning, to never stop evolving.

Practical Application: Recognize when you might be on the “peak of Mt. Stupid” and actively seek feedback and knowledge to climb towards true expertise.

3. The Peter Principle: Rising to the Level of Incompetence

Consider Alex, a top-performing sales associate in a retail company. His exceptional sales record led to a promotion to sales manager. However, managing a team, unlike closing sales deals, wasn’t his forte. Alex’s struggle in his new role is a textbook example of the Peter Principle: excelling in one position doesn’t guarantee competence in a higher role.

Practical Application: Assess your own career path. Are you equipped for your current role, or is there a skill gap you need to address?

4. Anchoring Effect: The First Number Sticks

In negotiations, the first number thrown out often becomes an invisible anchor, influencing all that follows. Think about the last time you haggled for a car or negotiated your salary. The initial figure sets the stage, impacting the entire negotiation dance.

Practical Application: Be mindful of initial figures in negotiations — whether you’re buying a car or discussing a raise. Set your anchors wisely!

5. The Cobra Effect: Good Intentions, Unintended Consequences

When the British government in colonial India offered a bounty for cobras, it led to people breeding cobras instead of reducing their population. This is the Cobra Effect, where solutions can sometimes create more problems.

Practical Application: Think through the potential unintended consequences before implementing a solution. Look for a holistic understanding rather than quick fixes.

6. Amara’s Law: Misjudging Technology’s Impact

Consider the rise of social media platforms like Facebook or Instagram. Initially, many viewed them as simple online spaces for sharing photos and catching up with friends. However, over time, their long-term impact has been profound, reshaping how we communicate, influencing global politics, and even affecting mental health. This illustrates Amara’s Law: in our tech-driven world, we often overestimate the short-term effects of new technologies while vastly underestimating their long-term implications. This principle is particularly important for businesses and individuals trying to navigate the ever-evolving landscape of the digital age.

Practical Application: Balance your expectations when evaluating new technology. Consider long-term implications, not just immediate benefits.

7. The Law of Least Effort: Path of Minimum Resistance

Consider the popularity of ride-sharing apps like Uber or Lyft. These services exemplify the Law of Least Effort by offering a more convenient alternative to traditional taxis or public transport. People often choose these apps for their ease of use and accessibility.

Practical Application: When designing products, services, or even your daily routine, aim for simplicity and ease to encourage usage and adherence.

8. Brooks’s Law: More Is Not Always Better In project management

Picture a software development team racing against a tight deadline. In a last-minute bid to speed things up, additional programmers are brought in. Instead of accelerating progress, the project stalls further as the new team members require training and orientation. This scenario is a classic example of Brooks’s Law, which posits that adding manpower to a late project only makes it later.

Practical Application: In managing projects, consider the integration and training time new members require. Sometimes, more is not better.

9. The Law of Triviality (Bike Shedding):

The Focus on the Inconsequential Also known as “Bike Shedding,” this law describes how people spend disproportionate time on trivial issues. It’s a common occurrence in meetings where minor details consume hours while major issues get minimal attention.

Practical Application: Next time you’re in a meeting, play the role of the focus-shifter. Watch how discussions veer towards the inconsequential and gently steer them back to the matters that truly impact the bottom line. Remember, the color of the bike shed might be interesting, but it’s the structural integrity of the building that matters most.

10. The Contrast Principle: Relative Perception

Our perceptions are heavily influenced by comparisons, as illustrated by the Contrast Principle. A moderately priced meal seems affordable next to an expensive one, and a warm day feels hot following a cold spell.

Practical Application: Be aware of how contrast might be affecting your judgments. When making decisions, try to assess options on their own merits, not just in comparison to others.

11. The Endowment Effect: Overvaluing What We Own

Ever wondered why it’s so hard to part with that old guitar gathering dust in the corner, even though you haven’t strummed it in years? Welcome to the Endowment Effect, where everything we own, from musical instruments to quirky collectibles, magically gains an inflated value in our eyes. It’s the reason why garage sales are battles of wills, and why decluttering feels like parting with pieces of our soul.

Practical Application: Next time you hesitate to donate or sell something, ask yourself: “Am I valuing this because of its use, or just because it’s mine?”

12. The Serial Position Effect:

Remembering the First and Last In lists or presentations, the first and last items are typically remembered best. This is known as the Serial Position Effect, encompassing the primacy and recency effects.

Practical Application: When delivering information, place the most important points at the beginning or end. This can be particularly effective in presentations or teaching.

13. The Spotlight Effect: We’re Not as Noticed as We Think

The Spotlight Effect is the tendency to overestimate how much others notice our appearance or behavior. It’s that feeling when you trip in public and think everyone saw.

Practical Application: Remember that everyone is more concerned with themselves than with you. This can be liberating in social situations or public speaking.

14. The Foot-in-the-Door Technique: Small Commitments Lead to Larger Ones

This technique involves getting someone to agree to a small request as a precursor to a larger one. It’s a common principle in sales and persuasion.

Practical Application: Start with small requests to build up to larger ones, whether in fundraising, selling, or persuasion.

15. The Ben Franklin Effect: Seeking Consistency in Behavior

The Ben Franklin Effect suggests that when someone does you a favor, they’re more likely to do you another, as people seek consistency in their behavior.

Practical Application: Don’t hesitate to ask for small favors. It can be a starting point for building stronger relationships.

16. The Pygmalion Effect: The Power of Expectations

Consider a manager who believes strongly in a team member’s abilities. That belief, communicated through expectations and support, often results in the employee reaching new heights in their career.This exemplifies the Pygmalion Effect.

Practical Application: Set high expectations for those around you — employees, students, even family members — and provide them with the support to meet these expectations.

17. The IKEA Effect: Valuing Our Own Labor

Consider a meal you’ve cooked from scratch, laboring over each ingredient. Somehow, it always tastes better than a store-bought dish, right? This isn’t just culinary skills at play; it’s the IKEA Effect. The effort we put into creating something, be it food, furniture, or art, endows it with extra value in our eyes. It’s a blend of pride, effort, and, yes, a little bit of love.

Practical Application: Think of something you’ve built or created recently. How does the effort you put into it change how you feel about the final product?

18. Equity Theory: Balancing Input and Output

Picture yourself at work, putting in extra hours, crafting perfect presentations, only to receive the same recognition as your colleague who seems to do the bare minimum. Frustrating, isn’t it? This is Equity Theory in action. It explains why we feel disheartened when our hard work doesn’t seem to pay off as it should. It’s about the balance, or imbalance, of what we put into our jobs (input) versus what we get out of them (output).

Practical Application: Strive for fairness in your interactions. Recognize the efforts of others and ensure they feel valued.

19. Hick’s Law: The Paradox of Choice

Ever found yourself overwhelmed in the supermarket, staring blankly at the dozens of options? That’s Hick’s Law in real life. The more choices we have, whether it’s cereals, cars, or clothes, the harder it becomes to make a decision. This paradox of choice can lead to decision fatigue, making even the simplest choices feel daunting.

Practical Application: Simplify choices to make decision-making easier, whether in business or personal life.

20. Parkinson’s Law: Work Expands to Fill Time

Tasks often expand to fill the time allotted for them, a phenomenon known as Parkinson’s Law. If you give yourself a week to complete a two-hour task, it will take a week.

Practical Application: Set realistic deadlines to improve efficiency. Use this principle to manage time and avoid procrastination.

Conclusion:

And there you have it — a whirlwind tour through the labyrinth of our minds. These principles aren’t just textbook concepts; they’re alive in every decision we make, every relationship we hold, and every goal we chase. Understanding them is like having a roadmap to the human psyche, helping us navigate life with a bit more wisdom and a lot more awareness. So, what’s the next principle you’ll spot in your daily life?

Data Science Pro-Tips: 5 Python Tricks You Must Know

Thu, 13 Apr 2023 22:12:03 GMT

As a data scientist, Python is the go-to tool. Its versatility, with a large ecosystem of libraries and rich data manipulation capabilities, makes it a preferred language for data analysis and machine learning. But, are you fully leveraging Python’s potential to optimize your data science workflows?

In this article, I will share with you some of the most practical tips and tricks for data science using Python. Whether you are a beginner looking to level up your Python skills or an experienced data scientist seeking to enhance your productivity, these tips will help you unlock new possibilities in your data science projects.

Never loop over a dataframe ! Use .apply() instead.

To perform any kind of data transformation, you will eventually need to loop over every row, perform some computation, and return the transformed column.

A common mistake is to use a loop with the built-in for loop in Python. Please avoid doing that as it can be very slow. The correct way is to use the apply function in Pandas, ideally combined with a lambda function if your transformation logic is simple, or an external function that you define if the logic is complex.

Here’s an overview of the apply function with an example using the Titanic dataset:


	# Defining a custom function to categorize age groups
	def categorize_age(age):
	if age < 18:
	return 'Child'
	elif age >= 18 and age < 30:
	return 'Young Adult'
	elif age >= 30 and age < 50:
	return 'Adult'
	else:
	return 'Senior'

	# Use .apply() to apply the custom function to the "Age" column
	titanic_df['Age_Category'] = titanic_df['Age'].apply(categorize_age)

	# Print the updated dataframe
	titanic_df[['Age', 'Age_Category']].head(5)

view raw applyTitanic.py hosted with ❤ by GitHub

Note that you can use apply to combine multiple columns from the dataframe, but you need to add axis=1 as an argument to the apply function. Here’s an example using a lambda function and combining two rows, price_1 and price_2, to create a new row tot_price.

df["tot_price"] = df.apply(lambda row: row["price_1"]+ row["price_2"], axis=1)

Select specific column types with select_dtypes()

A very common situation is when you have a large DataFrame with multiple columns of different data types, and you need to filter or perform operations only on columns of a specific data type. Pandas provides select_dtypes() as a convenient function to do that. Let’s see an example:

	import pandas as pd

	# Load the Titanic dataset
	titanic_df = pd.read_csv('titanic.csv')

	# Use select_dtypes() to select only numerical columns
	numerical_cols = titanic_df.select_dtypes(include='number')

	# Print the selected numerical columns
	numerical_cols.head(5)

view raw selectdtypes.py hosted with ❤ by GitHub

In this example, we are selecting only the numerical columns in the Titanic dataset.

Use Pandas query() instead of a boolean mask to filter your DataFrame:

Using query() can make your code shorter and cleaner. Here’s a comparison between the two syntaxes

# Filter using boolean masks  
  
titanic_df = titanic_df[(titanic_df["Sex"] == "female") & (titanic_df["Age"] > 18)]  
  
# Filter using query()  
  
titanic_df = titanic_df.query('Sex == "female" and Age > 18')

Instead of having to write “titanic_df” twice in my mask, using query() I only had to mention the columns. It achieves the same result while being cleaner and more readable!

Use list comprehension to create lists in one line:

List comprehension is a concise and powerful technique in Python that allows you to create lists in a single line of code. It provides a concise way to generate new lists by applying an expression to each element in an iterable, such as a list, tuple, or string, and returning the result as a new list. It is shorter and more readable than using a traditional loop.

Here’s the basic syntax:

[expression for item in iterable if condition]

Here’s an example of using list comprehension to create a list of even numbers from a given list:

	numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
	even_numbers = [x for x in numbers if x % 2 == 0]
	print(even_numbers)

view raw even_numbers.py hosted with ❤ by GitHub

[2, 4, 6, 8, 10]

Keep in mind that you can also create dictionary comprehensions, set comprehensions, and generator comprehensions in Python.

Enhance Your Loops with enumerate() and zip() in Python:

enumerate() is used to loop over an iterable while keeping track of the index or position of each item. It helps you avoid using an extra variable, like i. The basic syntax for using enumerate() in a loop is as follows:

for index, item in enumerate(iterable):  
    # Do something with index and item

Here’s an example:

	names = ["Ali", "Ahmed", "Bob", "Mary"]
	for index, name in enumerate(names):
	print(f"Index: {index}, Name: {name}")

view raw enumerate.py hosted with ❤ by GitHub

Index: 0, Name: Ali  
Index: 1, Name: Ahmed  
Index: 2, Name: Bob  
Index: 3, Name: Mary

zip() is used to combine two or more sequences into a single iterable object that can be looped over in parallel. It helps you avoid using multiple nested loops, making your code cleaner. The basic syntax for using zip() in a loop is as follows:

for item1, item2 in zip(sequence1, sequence2):  
    # Do something with item1 and item2

Here’s an example:

	names = ["Ali", "Ahmed", "Bob", "Mary"]
	ages = [25, 30, 35, 40]
	for name, age in zip(names, ages):
	print(f"Name: {name}, Age: {age} years")

view raw zip.py hosted with ❤ by GitHub

Name: Ali, Age: 25 years  
Name: Ahmed, Age: 30 years  
Name: Bob, Age: 35 years  
Name: Mary, Age: 40 years

Conclusion:

To sum up, by implementing these top 5 Python tips in your data science projects, you can make your code cleaner and more readable.

I hope that these tips will help you level up as data scientist!

If you managed to get here Congratulations. Thanks for reading, I hope you’ve enjoyed the article. Feel free to reach out to me on LinkedIn for further discussion or personal contact.

How I Passed The AWS Solution Architect Associate (SAA-C03)

Fri, 02 Dec 2022 22:12:03 GMT

How I Passed The AWS Solution Architect Associate (SAA-C03)

I passed the AWS Solutions Architect Associate (SSA-003) exam in December 2022. In this article I’ll share with you resources I used, some tips for the exam and some notes I took during preparation. With some prior experience using AWS (S3, Lambda, EC2, RDS) and some general IT knowledge, it took me around one month of light preparation.

What you’ll learn?

You’ll learn how to design good systems on AWS, which means given a requirement on resiliency, performance, cost, security and availability. How can I glue different AWS services to design the best system possible. This means you’ll have to know deeply the services on AWS and also best practices for designing systems.

How did I prepare?

Took a course on Udemy by “Ultimate AWS Certified Solutions Architect Associate SAA-C03” By Stephane Maarek, comprehensive gives you high level understanding with an emphasis on the “why”, you need to complement that by playing around in AWS console doing hands on yourself.
While taking the course referred to official documentation for each service to get to the fine details.
Did 6 mock exams by Jon Bonso, provided very good explanation on each question and is really close to the real exam. You should aim for a score >80% on these mock exams before taking the real one.

Some Tips:

Most of the exam is about the core services (S3, EC2, SQS, VPC, etc ..), it is helpful to study newer services but only high level understanding is required.
To check your knowledge on a certain service you can ask yourself, what does this service do? when should I use this service? How does it integrate with other services? What about security and high availability?
In some questions you’ll find multiple answers that technically work, read the question again, it will often mention something like most “cost effective”, or “with least operational overhead”. Use this to guide your final choice.
Take it very slowly the exam is about 2 hours, I finished all the questions with 40 minutes to spare.

Below you find my non comprehensive list of notes, I took during preparation.

EC2:

A Service to rent VMs.
Compute optimized instances start with C, Memory Optimized start with R, Storage optimized start with I or D.
Give your instance permissions with IAM Roles.
On demand instances are most expensive good for temporary workloads
Reserved instances for 1 or 3 years good for consistent demand, convertible reserved can be exchanged for instance type of same family but are more expensive
Spot instances are cheap and good only for workloads that can be interrupted. With dedicated hosts you get direct access to the hardware, with dedicated instances you make sure no other customer is using same hardware as you. To terminate persistent spot instances, cancel the request first then terminate the instances.
ENIs are network cards, they have public / private Ips, they can be attached or detached from EC2 instances, they’re useful for failovers.
Use cluster placement for HCP (High performance computing) applications (Single AZ), Use spread placement for critical applications (max 7 instance per AZ) , Use Partition for best of both worlds.
Elastic Ips let you keep same public IP when you stop then start an instance, you pay for elastic IPs you are not using.
Hibernation lets you save RAM state, you are billed when instance preparing to hibernate, you are not billed if instance is preparing to stop.3.
Limits: 20 running instances per region, also a VCPU limit, need to get validation from AWS to increase limits.

Storage:

Object storage: S3
File storage: EFS (EFS has two tiers standard and Infrequent Access), FSx for luste, FSx for windows.
Volume Storage: EBS. if you need more than 16K IOPS use provisioned IOPS EBS, to encrypt an unencrypted EBS, create a snapshot, copy snapshot enabling encryption and create new volume from this snapshot.
You can use EBS snapshots to move data between AZs
AWS datasync to migrate on premises storage to S3, EFS or FSx. Storage gateway connects your on premises storage to AWS, File gateway uses NFS and SMB, Volume gateway syncs data to S3 and tape gateway offers compatibility with tape data uses S3.
On S3 you can enable versioning and MFA delete to prevent accidental deletions, you can enable encryption by default on bucket settings, you can add a header to your put request to be able to encrypt specific file, you can add bucket policy to prevent files that are not encrypted to be uploaded.
To migrate data if you don’t have good network bandwidth use a Snowdevice (snowcone physically small or snowball edge or snowmobile). Snowball Edge max 80 TB, Snowcone 8 TB. Snowball can’t import directly to glacier you have to use lifecycle policy

Load Balancers :

Want to route same client to same machine? Enable “sticky sessions” option.
All load balancer support heathchecks and use target groups, you can setup multiple target groups
Cross-zone load balancing will make traffic even among instances in multiple AZs, useful when you AZs have different number of instances. Enabled by default ALB, need to enable and pay for traffic for NLB, need to enable but free for CLB.
ALB/NLB can use multiple SSL certificates using SNI, CLB can’t.
Connection Draining (CLB) / Deregistration Delay (NLB) : if instance is deregistering you wanna give some time for clients to complete their requests before removing it from target group, you can configure this delay. Set it based on request length if short or long, can be between 1 and 3600 seconds.
Classic Load Balancer (CLB): Old one, support TCP & HTTP/S, supports heathchecks, fixed hostnames. No reason to use over modern ones.
Application Load Balancer (ALB): Support HTTP/S (layer 7), routing rules based on path, url parameters, etc … Good for microservices (Docker / ECS) , server won’t see original request, if target wants original ip or port needs to get them from request header forwarded by ALB. Can use lambda function as target group.
Network Load Balancer (NLB): Supports TCP & UDP (layer 4), very high performance, less latency than ALB, you have one static IP per AZ. Forwards the original request from client. You can use with EC2 instances & IP addresses.
Gateway load balancer: Make all traffic go through 3rd party security systems like firewalls, intrusion detection, etc.

Auto Scaling Groups (ASG):

Works with EC2 Instances, Integrates with load balancers, you can set a min and max size, desired capacity is number of instances launched initially. Does health checks by default.
To create one you need a launch template which defines what kind of EC2 instances you want in your ASG. You scale based on cloudwatch alarm (a certain metric) to create scaling policies.
Can’t modify launch templates once created, you need to replace it with new one. ASGs are free, you only pay for resources.

Scaling policies:

Target Scaling: target certain metric for example want cpu usage to be 50%
Simple Scaling: If CPU>90 % for example add 3 instances.
Scheduled Scaling: Every Sunday from 8 AM to 5 PM add 6 instances
Predictive Scaling: Scale based on historical data using forecasting.
Cooldown Period: After scaling trigger happens wait (300 secs default) before another trigger can happen. Used to wait for scaling metric to stabilize before triggering scaling.
You can define lifecycle hooks to do extra stuff before launch/ terminating for example, before launching an extra instance you want it to run some script to download software.

ASG Termination Policy:

Find AZ with most number of instances.
Delete one with oldest launch template.
Delete instance closest to next billing hour.

Relational Database Service (RDS):

Postgres, MySQL, MariaDB, Oracle, SQL Server, Aurora (AWS only)
RDS Storage can autoscale
Read Replica: can create up to 5, same AZ, cross AZ or Cross region, Async replication between main DB and read replicas. Read replicas can be promoted to own main DB. Cross region read replica have network costs.
Multi AZ for Disaster Recovery: Standby DB in another AZ with automatic failover using same DNS name, uses synchronous replication.
Possible to use Multi AZ with your Read Replicas.
You can activate Multi AZ after deployment just by changing config.
Continuous Backup and Restore with retention up to 35 days.
Scale by read replica or bigger instance.
Supports encryption at rest using KMS and in-flight encryption with SSL, you can enforce SSL, method depends on your db engine.
To encrypt RDS after deployment, create snapshot, encrypt the snapshot and restore your DB from the snapshot.
You can use IAM Auth for MySQL and Postgres.

Aurora:

Aurora only supports MySQL and Postgres, with Aurora you can get 15 read replicas, automatic storage scaling. High availability by default. Supports cross region replication.
Read replicas can auto scale, you only deal with reader endpoint/ writer endpoint. There’s one writer and multiple readers.
You can enable Multi-Master if you need high availability for writer node, makes all instances capable of both write and read.
Comes in provisioned or serverless modes. Can group your read replicas into custom endpoint good if you have different type of reader instances and you want to group them based on workload.

Elasticache:

Managed Redis or Memcached, in memory databases for low latency and high performance.
Uses cases: cache common queries to help reduce load off database or store user session.
Redis VS Memcached: Redis has Multi AZ, Read Replicas, Backups. Memcached faster but no durability features.
In Redis to enable Redis Auth that’s used for security you need to enable encryption in transit.

Dynamo DB:

Managed NoSQL database, Provisioned read/write capacity or on demand mode to pay for whatever read/writes you actually consume. Provisioned mode supports autoscaling for reads and writes. Key/Value db. Muti AZ by default. You can make it global enabling Dynamo DB global tables. Supports backup and restore.
Enable DAX for auto read cache, whenever you need accelerated reads.
Security and auth is integrated with IAM. Use dynamo DB streams to detect changes and trigger events based on them, you need to enable this for global tables to work.

SQS:

Used to decouple applications
default retention for messages is 4 days with max of 14 days, 256KB per message, Consumers and producers, polling for messages is asking queue for messages, a consumer can receive up to 10 messages at a time.
Queue Access Policy can allow another aws account to access your queue.
Message visibility timeout indicates how much time message is invisible to other consumers while pooling or how much time consumer has to process messages by default it is 30 secs and max is 12 hours, after the timeout the messages return to the queue if not deleted. You can change message visibility in real time using API call Change Message Visibility.
Delivery Delay allows you to set delay up to 15 mins, messages will only be visible after the delay.
Dead Letter Queue: If message has been returned to queue many times maybe there’s an error and you may want to get rid of it, after max receives messages will be sent automatically to DLQ you can then process/debug them and return them to regular queue.
Long Pooling vs Short Pooling:default is short pooling, if queue is empty sqs will sent empty response immediately you will have to send new request, with long pooling sqs will wait up to 20 secs for new messages to arrive, it is useful to decrease number of API calls.
To implement Request Response Systems use the built in SQS Temporary Queue Client.
FIFO Queue: messages are ordered, limited to 300 messages/s to 3000 messages/s using batching.

SNS:

Pub/Sub pattern, publisher sends the message to service, SNS distributes the message to all subscribers, you can filter messages so not all subscribers get all messages, up to 12M subs per topic, subs can be email, sms, http enpoints or aws services(sqs, lambda, etc ..)
Supports inflight / at rest encryption, you can use SNS access policies for cross account sharing or allowing other services to write to SNS.
SNS fanout pattern to send messages efficiently to multiple SQS queues. You can preserve order and ensure deduplication by using SNS FIFO that works only with SQS FIFO.

Kinesis:

Analyse streaming data in real time.
Data Streams: Capture, process and store datastreams.
Performance component is shards, 1 MB /s or 1000 messages/s per shard writing, for reading you get 2 MB/s per shard for all consumers or for each consumer but more expensive (enhanced fanout), each record can be up to 1 MB. You can choose provisioned or on demand capacity, the latter will scale automatically with more shards. Data Retention for 1 up to 365 days, you can replay data.
Firehose: Load data streams into AWS data stores. Can read from data streams and optionally use lambda to transform the data before writing it in batch to an AWS destination S3, Redshift, Elastic Search or custom destination. Can send min of 32 MB per batch.
Data Analytics: Analyse data streams with SQL. Serverless integrates well wil Firehose and Data Streams. Used for timeseries analytics, real time dashboards, etc ..
Video Streams: Capture, process and store video streams.

Elastic Container Service (ECS):

Good for microservices, you can store images on Amazon Container Register (ECR), You can do EC2 Launch Type or Fargate Launch Type(serverless), Docker containers you use are called “tasks”, ECS tasks can be invoked by AWS Event Bridge.
IAM Roles for ECS: for EC2 Launch you get EC2 Instance Profiles, for EC2 and Fargate you get task roles, you can then finetune access based on containers (tasks). You can integrate both launch types with ALB or NLB. The file system you use with ECS is EFS because it is shared. Can’t use FSx for Lustre or Windows, you can’t use S3 as file system.
To use you create task definition then you launch a service from that task definition. You can autoscale on the task level using AWS Application Auto Scaling, can be based on CPU, RAM or ALB request count.
Rooling updates: you can control how tasks start and stop when updating using min heathy percent / max healthy percent metric.
Elastic Kubernetes Service (EKS) : managed kubernetes on AWS, alternative to ECS with opensource API, supports EC2 or Fargate.

Route 53:

Used to register domains / DNS manager. You can use public or private hosted zones, private one only work inside your VPCs. TTL (Time to Live): Client will cache the result of DNS for the duration of TTL default is 300 secs. TTL Not mandatory for Alias Record.
CNAME point hostname to another hostname but you can’t use root domain, To use root domain need activate Alias with A record. You can’t set Alias record for EC2 DNS name.
To enable healthchecks you must allow traffic from route 53 health checkers to your resources.

Routing Policies:

Simple: specify one or multiple Ips, it will route to randomly chosen one, no health checks.
Multivalue: Like simple but with healthchecks.
Weighted: Assign different weights to different resources and route based on relative weight, supports heathchecks.
Failover: used for Disaster Recovery, uses healthcheck, you can only use 2 records here one primary and one secondary.
Geolocation: You can use it to change behavior based on user country for example block content or change language.
Geoproximity: Route based on geographic location, t’s more flexible than geolocation you can use bias to expand or shrink geo region allocated to a specific resource, with 0 bias users go to closest resource.
Latency: Route to closest aws region, supports heathchecks.

if you managed to get here Congratulations. Thanks for reading, I hope you’ve enjoyed the article. For personal contact or discussion, feel free to reach out to me on LinkedIn.

From Idea to Reality: Building a Price History Tool for Moroccan Ecommerce

Tue, 11 Oct 2022 22:40:32 GMT

Have you ever wanted to track the prices of products on an ecommerce platform but found that no price tracker existed for that specific platform? In this article, I’ll share with you how I built a Price Tracker app for Moroccan ecommerce platforms and hosted it on AWS for free. This simple end-to-end data engineering project includes some UX elements and will teach you about web scraping and how to use some of AWS’s services. You can try the app using this link.

The Context & The Plan:

The main value of a Price Tracker is to provide you with the historical price of a product so that you can make your purchasing decision based on data, among other criteria, and minimize the effect of FOMO/discounts that can be in some cases just a form of marketing.

Price trackers exist for all major international ecommerce websites, such as Amazon, eBay, and Alibaba, but they don’t exist for ecommerce platforms in Morocco. The goal of this project is to create a simple price tracker and host it for free. To achieve the latter, I chose to make use of AWS free tier, which offers quite generous cloud resources, just enough to bootstrap this kind of project if used efficiently. Learn more about AWS Free Tier Here

Technical Architecture:

The following picture summarizes the architecture I chose to spread out and make use of a variety of AWS components, which is more optimal for efficiency:

The architecture split into two main sections:

Scraping/ETL: This section is responsible for periodically getting the data, transforming it, and loading it into a Postgres database. I made use of Airflow for scheduling and coordinating tasks written in Python and S3 for an extra backup of the data.

Data Delivery / UX : In classic web app fashion, we have a front-end UI written in JavaScript, calling a REST API which is, in this case, powered by AWS Lambda, which interacts with our Postgres database in RDS. Making efficient use of resources like this is what made it possible to host the project for free.

Data Model:

The main table, called “Prices,” holds historical price data. We also maintain details about the products tracked in the “products” table and analytics/recommendations related data, such as “prod_ranking” and “KPI” tables.

Generating the best deals:

To generate the best deals, we calculate the average price of a particular product and compare it to its actual price today, getting the percent difference. For example, in the picture below:

The product is down 27.62% from its average price. To further enhance recommendations, we prioritize popular products with the highest number of reviews by category.

Deep Dive Into Scraping:

The first step is to get URLs of the categories, as shown below:

Now each category has multiple pages, and we use the page number as a variable to navigate and grab products on each page.

Below is the full scraping code, it utilizes python’s request module, beautifulsoup to parse html and and tqdm for multithreading which accelerates the task. To learn more about scraping I’d recommend my article or similar content.

	# coding=utf-8
	from bs4 import BeautifulSoup
	import requests
	from tqdm import tqdm
	from datetime import datetime
	import pandas as pd
	from tqdm.contrib.concurrent import thread_map

	outfile='/opt/airflow/dags/jumia_data'+str(datetime.today().strftime('%Y-%m-%d'))+'.csv'


	def process_article(article):
	dataid=article.find('a').get('data-id')
	href=article.find('a').get('href')
	category=article.find('a').get('data-category')
	name=article.find('a').get('data-name')
	price=article.find(class_='prc').text
	stars=article.find(class_='stars _s').text if article.find(class_='stars _s') else None
	reviewcount=article.find(class_='rev').text if article.find(class_='stars _s') else None
	brand=article.find('a').get('data-brand')
	discount=article.find(class_='bdg _dsct _sm').text if article.find(class_='bdg _dsct _sm') else False
	boutiqueOfficielle=True if article.find(class_='bdg _mall _xs') else False
	etranger=True if article.find(class_='bdg _glb _xs') else False
	fastDelivery=True if article.find(class_='shipp') else False
	image=article.find(class_='img').get('data-src')
	return {'reviewcount':reviewcount,'img_url':image,'id':dataid,'href':href,'name':name,'category':category,'brand':brand,'price':price,'stars':stars,'discount':discount,'boutiqueOfficielle':boutiqueOfficielle,'etranger':etranger,'fastDelivery':fastDelivery,'timestamp':datetime.today().strftime('%Y-%m-%d %H:%M:%S')}

	def process_page(url):
	page = requests.get(url)
	soup = BeautifulSoup(page.text, 'html.parser')
	PageData=[process_article(article) for article in soup.find_all(name='article',attrs={'class':'prd _fb col c-prd'})]
	PagaDataTable=pd.DataFrame(PageData)

	PagaDataTable.to_csv(outfile, mode='a', index=False,header=False, encoding="utf-8")

	def process_sub_category(url):
	page = requests.get(url)
	soup = BeautifulSoup(page.text, 'html.parser')
	try:
	numPagesToScrape=soup.find(name='a',attrs={'class':'pg','aria-label':'Dernière page'}).get('href').split("page=")[1].split("#")[0]
	except:
	numPagesToScrape='1'
	urls=[url+'?page='+str(i) for i in range(int(numPagesToScrape)+1)]
	thread_map(process_page,urls,max_workers=32)


	def start_scrape():
	subCategories=pd.read_csv('/opt/airflow/dags/subCategoriesHrefs.csv')
	subCategories.href=subCategories.href.apply(lambda s: str(s).split("?shipped_from=country_local")[0])
	pd.DataFrame({'reviewcount':[],'img_url':[],'id':[],'href':[],'name':[],'category':[],'brand':[],'price':[],'stars':[],'discount':[],'boutiqueOfficielle':[],'etranger':[],'fastDelivery':[],'timestamp':[]}).to_csv(outfile,index=False)
	for subCategory in tqdm(subCategories.href):
	process_sub_category(subCategory)


	if __name__ == "__main__":
	start_scrape()

view raw scrapeJumia.py hosted with ❤ by GitHub

Airflow: A Powerful Task Scheduling Platform:

Airflow is a robust platform that enables users to create and run workflows using Directed Acyclic Graphs (DAGs) and tasks with dependencies and data flows taken into account. With Airflow, users can specify the order of execution and run retries as well as describe what to do with each task, such as fetching data, running analysis, triggering other systems, and more.

One of the most significant advantages of using Airflow is its user-friendly graphical interface, which allows you to track the progress of your tasks in real-time, while also providing built-in retry on failure and integration with most popular databases. Moreover, it stores the execution times and logs, making it incredibly useful for debugging.

To learn more about Airflow, check out the official documentation, which is the best place to get started.

Below is the DAG used in the project, along with the main Python code used to generate it:

	import airflow
	from airflow import DAG
	from airflow.operators.python_operator import PythonOperator
	from datetime import timedelta
	import sys, os

	sys.path.insert(1, '/opt/airflow/dags/scripts')
	from scrapeJumia import *
	from updateProducts import *
	from updatePrices import *
	from updateProdRanking import *
	from updateKpi import *
	from uploadS3 import *

	default_args = {
	'owner': 'airflow',
	'depends_on_past': False,
	'start_date': airflow.utils.dates.days_ago(2),
	'email': ['youremail@gmail.com'],
	'email_on_failure': True,
	'email_on_retry': False,
	'retries': 1,
	'retry_delay': timedelta(minutes=1)

	}
	dag_python = DAG(
	dag_id = "jumia_python",
	default_args=default_args,
	description='Dag that srapes from jumia and updates a postgres database in RDS',
	schedule_interval='10 0 * * *',
	catchup=False
	)
	scrape_jumia = PythonOperator(task_id='scrape_jumia', python_callable=start_scrape, dag=dag_python)
	update_products = PythonOperator(task_id='update_products', python_callable=start_update_products, dag=dag_python)
	update_prices = PythonOperator(task_id='update_prices', python_callable=start_update_prices, dag=dag_python)
	update_prod_ranking = PythonOperator(task_id='update_prod_ranking', python_callable=start_update_prod_ranking, dag=dag_python)
	update_kpi = PythonOperator(task_id='update_kpi', python_callable=start_update_kpi, dag=dag_python)
	upload_s3 = PythonOperator(task_id='upload_s3', python_callable=start_upload_s3, dag=dag_python)
	scrape_jumia >> upload_s3 >> update_products >> update_prices >> update_prod_ranking >> update_kpi

view raw jumiaDag.py hosted with ❤ by GitHub

AWS Lambda: A Serverless Backend Solution:

Lambda functions are incredibly flexible and can be used for a wide range of applications. In this project, they were used as a REST API to offload the workload from the main EC2 server. It’s easy to get started with Lambda, simply choose your preferred language and start a function from scratch or use a container or one of the provided AWS blueprints.

Once you’ve created your function, you’ll need to set it up for your use case. In my experience, this includes setting up “layers,” which allow your function to use external libraries such as pandas and sqlalchemy. You’ll also need to set up the REST API to call the function from the web, enabling CORS (Cross-Origin Resource Sharing) to allow calls from your browser. The AWS documentation does an excellent job of explaining this.

After setting up your Lambda function, you’ll have a function with layers and an API gateway:

To enable your function to communicate with your RDS database, you’ll need to connect it to a VPC in the same subnets as your RDS setup and create a “security group” that allows connection on the Postgres port 5432 and assign it to the function:

Here’s an example of a function that gets product details given a product ID or URL:

	import json
	from sqlalchemy import create_engine
	import pandas as pd
	def lambda_handler(event, context):
	engine = create_engine('postgresql://postgres:password!@host:5432/database')


	try:
	product_id=json.loads(event['body'])["prod_id"]
	sql="SELECT * FROM products where id='"+product_id+"'"
	except:

	href_full=json.loads(event['body'])["href"]
	if href_full.find("www")>0:
	href=json.loads(event['body'])["href"].split("https://www.jumia.ma")[1]
	if href.find("?")>0:
	href=href.split("?")[0]
	else:
	href=json.loads(event['body'])["href"].split("https://jumia.ma")[1]
	if href.find("?")>0:
	href=href.split("?")[0]
	sql="SELECT * FROM products where href='"+href+"'"
	product=pd.read_sql(sql,con=engine)

	return {
	'headers': {
	'Content-Type': 'application/json',
	'Access-Control-Allow-Origin': '*',
	'Access-Control-Allow-Headers': 'Authorization,Content-Type',
	'Access-Control-Allow-Method': 'GET,POST,OPTIONS',
	},
	'statusCode': 200,
	'body': json.dumps(product.to_dict(orient='records')[0])
	}

view raw getProduct.py hosted with ❤ by GitHub

Conclusion:

It was an exciting and fulfilling experience working on this project, as it has real-world applications for the average person. AWS’s free tier offers a generous package, making it ideal for prototyping compared to the competition. As long as you use it efficiently and do not exceed the limits, you can host almost any project.

Thank you for reading this article. We hope you found it informative and learned something new. If you have any questions or would like to discuss further, feel free to reach out on LinkedIn: LinkedIn.

Every Data Scientist Needs To Learn This

Sun, 25 Oct 2020 22:40:32 GMT

Photo by Rock’n Roll Monkey on Unsplash

Ever had the idea of this amazing data science project, you look up the data you’ll need online but sadly it’s nowhere to be found? Unfortunately, not every dataset you’ll ever need is online. So, what should you do? Abandon your idea and go back to kaggle? No! A real data scientist should be able to collect his own DATA!

What’s Web Scraping and why learn it?

The web is the single biggest resource for data, it’s a literal archive for human knowledge at least for the last 20 years. Web Scraping is the art of extracting that data off the web, as a Data Scientist It is such a handy tool and opens so many doors to cool projects.

Note that some websites prohibit scraping and might ban your IP address if you scrape too frequently or maliciously.

How do we scrape?

There are two approaches when it comes to web scraping.

Request based scraping: With this approach we will be sending a request to the website’s server which will return the HTML of the page which is the same content that you find when you click “View page source” on google chrome, you can try that out right now by pressing ctrl+u .Then we will typically use a library to parse the HTML and extract the data that we want. This approach is simple, lightweight and very fast, however it’s not perfect and there’s one drawback that might put you off using it, in fact most modern websites nowadays use JavaScript to render their content, IE: you don’t see the content of the page until after the JavaScript executes which the request method can’t handle.

Browser based scraping: To execute JavaScript we need a fully-fledged browser, this is what this method is about, we will simulate a browser, navigate to the page we want, wait for JavaScript to execute and we can even interact with the page by clicking buttons, filling forms… Then just look at the HTML state and extract the data. This approach is very flexible, you can pretty much scrape any website you want, however it’s much slower and resource intensive than just sending a request.

Scrape anything with selenium:

Selenium is widely used library for web automation, but you can actually use it for scraping too! Basically any task that a human can manually do, you’ll be able to simulate it with selenium, you can create a bot that will perform certain action when something happens, or you can make selenium browse web pages and scrape data for you which is what we’ll be doing in this article.

To parse the HTML we will be using beautiful soup.

For further reading here are documentation links for selenium and beautiful soup

Demo: Scraping Indeed Jobs

Let’s get some practice, the goal of this demo is to scrape jobs from indeed given a search query and save them in csv file.

More precisely we are interested in:

Job title
Location of the job
Company that posted the offer
Job description
When the job was posted

Here a link to a sample job page and here’s the project code

First let’s import the required libraries:

from bs4 import BeautifulSoup  
from webdriver_manager.chrome import ChromeDriverManager  
import pandas as pd  
from selenium import webdriver  
from selenium.webdriver.chrome.options import Options
chrome_options = Options()  
chrome_options.add_argument("--headless")

Beautiful Soup is for interacting with HTML
Pandas to export to csv
The web driver is the actual browser, we will be using chrome and configuring it to run on headless mode which means it will run in the background and we won’t be able to see a browser going through the job pages, this is optional if you want to see the browser you can remove it!

The first thing to do is to get the actual job pages, lucky indeed has a search function, all you have to do is to navigate to :

“https://ma.indeed.com/jobs?q=data+scientist&start=10”

You’ll get the second page of jobs related to data science, so you can specify the search query changing the q argument and the page number changing the start argument. Note that I’m using the Moroccan portal of Indeed, but this will work for any country.

We will be implementing two functions one is a helper function to navigate to a URL extracting the HTML and turning it to Beautiful Soup object that we can interact with and another to extracts links to the job pages:

def toSoup(url):  
    driver.get(url)  
    html = driver.page_source  
    soup = BeautifulSoup(html, 'lxml')  
    return soupdef getPageUrls(query,number):  
    url="[https://ma.indeed.com/emplois?q=](https://ma.indeed.com/emplois?q=)"+str(query)+"&start="+str(((number-1)\*10))  
    soup=toSoup(url)  
    maxPages=soup.find("div",{"id":"searchCountPages"}).text.strip().split(" ")[3]  
    return maxPages,[appendIndeedUrl(a["href"]) for a in soup.findAll("a",{"class":"jobtitle turnstileLink"})]

Now that we have the URLs let’s implement some functions to extract what we want out of the job page:

def paragraphArrayToSingleString(paragraphs):  
    string=""  
    for paragraph in paragraphs:  
        string=string+"\\n"+paragraph.text.strip()  
    return stringdef appendIndeedUrl(url):  
    return "[https://ma.indeed.com](https://ma.indeed.com)"+str(url)def processPage(url):  
    soup=toSoup(url)  
    title=soup.find("h1",{"class":"icl-u-xs-mb--xs icl-u-xs-mt--none jobsearch-JobInfoHeader-title"}).text.strip()  
    CompanyAndLocation=soup.find("div",{"class":"jobsearch-InlineCompanyRating icl-u-xs-mt--xs jobsearch-DesktopStickyContainer-companyrating"})  
    length=len(CompanyAndLocation)  
    if length==3:  
        company=CompanyAndLocation.findAll("div")[0].text.strip()  
        location=CompanyAndLocation.findAll("div")[2].text.strip()  
    else:  
        company="NAN"  
        location=CompanyAndLocation.findAll("div")[0].text.strip()  
    date=soup.find("div",{"class":"jobsearch-JobMetadataFooter"}).text.split("-")[1].strip()  
    description=paragraphArrayToSingleString(soup.find("div",{"id":"jobDescriptionText"}).findAll())  
    return {"title":title,"company":company,"location":location,"date":date,"description":description}def getMaxPages(query):  
    url="[https://ma.indeed.com/emplois?q=](https://ma.indeed.com/emplois?q=)"+str(query)

Here we are using HTML attributes such as “class” or “id” to locate information we want, you can figure out how to select the data you need by inspecting the page

Here’s an example for the title property:

We can see that the title is an “h1” that we can select using its class

Finally let’s implement a function to run get all the jobs and save them in csv file.

Note that we are getting the max pages number so that the crawler stops when we have reached the final page.

def getJobsForQuery(query):  
    data=[]  
    maxPages=999  
    for number in range(maxPages):  
        maxPages,urls=getPageUrls(query,number+1)  
        for url in urls:  
            try:  
                page=processPage(url)  
                data.append(page)  
            except:  
                pass  
        print("finished Page number: "+str(number+1))  
    #Save the data to a csv file  
    pd.DataFrame(data).to_csv("jobs_"+query+".csv")

Now let’s scrape Data Science Jobs:

driver = webdriver.Chrome(ChromeDriverManager().install(),options=chrome_options)  
getJobsForQuery("data scientist")

Here’s the result:

A Sample of scraped jobs

Conclusion

In this article we learned about the web scraping, why it’s important for every aspiring data scientist and the different approaches to do so, and we’ve applied that to scrape jobs from Indeed.

if you managed to get here Congratulations. Thanks for reading, I hope you’ve enjoyed the article. For personal contact or discussion, feel free to reach out to me on LinkedIn.

Arabic Topic Classification On The Hespress News Dataset

Sun, 18 Oct 2020 23:46:37 GMT

How to classify Arabic Text the right way

Photo by Markus Winkler on Unsplash

This article is the first in a series where I’ll cover analysis of the Hespress Dataset.

According to “alexa.com” Hespress is ranked 4’th in Morrocco, it’s the biggest news site in the country and the average Moroccan spends around 6 minutes daily on the website.

The Hespress Dataset is a collection of 11K news articles labelled by topic and 300K comments with a score by the users associated to each one of them, think of the scores as likes on a Facebook post. This dataset can be used for news article classification which will be our focus in this article and for sentimental analysis of the Moroccan general opinion. You can download the Dataset using the link below: Hespress *Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data…*www.kaggle.com

This article is aimed for people that have a little bit of knowledge about machine learning for example what’s the difference between classification and regression, what’s cross validation. However, I’ll give a brief explanation of the steps pursued for the project.

Problem Introduction:

Fortunately, our dataset contains both the articles and their labels, so we are dealing with a supervised learning problem which will make our life much easier since, if wasn’t the case, we would have to manually label each article or go with an unsupervised approach.

In brief, our goal is to predict the topic of an article given its text. In total we have 11 topics:

Tamazight (A Moroccan Language)
Sport (Sport)
Societe (Society)
Regions (Regions)
Politique (Politics)
Orbites (World news)
Medias (News from local newspapers)
Marocains Du Monde (Moroccans of the world)
Faits Divers (Miscellaneous)
Economie (Economy)
Art Et Culture (Art and culture)

Exploratory Data Analysis:

We’ll be using seaborn for data visualisation and pandas for data manipulation.

Let’s start by loading the data:

Since the data is stored in different files, each file contains data for a specific topic, we’ll have to loop over the topics and concatenate results.

import pandas as pd
stories=pd.DataFrame()
topics=["tamazight","sport","societe","regions","politique","orbites","medias","marocains-du-monde","faits-divers","economie","art-et-culture"]

for topic in topics:
  stories=pd.concat([stories,pd.read_csv("stories_"+topic+".csv")])

stories.drop(columns=["Unnamed: 0"],axis=1,inplace=True)

Next let’s get a sample from the data:

stories.sample(5)

Sample columns from the stories dataset

We can see that we have 5 columns, for this article we are only interested in the story and the topic features.

Now let’s check how much stories we have in each topic, this is extremely important for classification since if we have an **imbalanced dataset **i.e.(we have a lot more datapoints in a topic than the others) our model will be biased and won’t work as well. If we have this problem one common solution is to apply an under sampling or oversampling method, we won’t go over the details since it’s not in the scope of our article.

import seaborn as sns
storiesByTopic=stories.groupby(by="topic").count()["story"]
sns.barplot(x=storiesByTopic.index,y=storiesByTopic)

Count of stories by topic

We can see that we have almost 1000 stories per topic, our dataset is perfectly balanced.

Source: memegenerator.net

Data Cleaning:

We are dealing with Arabic text data. Our data cleaning process will consist of 2 steps:

Removing Stop Words: some words such as “و”, “كيف” have extremely high recurrence in all Arabic texts and provide no meaning that our model can use to predict. Removing them will reduce noise and let our model focus only on relevant words. To do so we will be using a list and looping over all the articles removing all the words that appear in the list.

The stop words list that I used is available on Github

from nltk.tokenize import word_tokenize

file1 = open('stopwordsarabic.txt', 'r', encoding='utf-8') 
stopwords_arabic = file1.read().splitlines()+["المغرب","المغربية","المغربي"]

def removeStopWords(text,stopwords):
    text_tokens = word_tokenize(text)
    return " ".join([word for word in text_tokens if not word in stopwords])

Removing Punctuation: For the same reason we’ll be removing punctuation, for this I’ve used a Regex expression.

from nltk.tokenize import RegexpTokenizer
def removePunctuation(text):
    tokenizer = RegexpTokenizer(r'\w+')
    return " ".join(tokenizer.tokenize(text))

Drawing a WordCloud:

Let’s have some fun, we’re going to be drawing a Word Cloud off all the stories in our DataSet using the python “WordCloud” library

Before doing so there’s some extra steps needed that are specific for Arabic, to learn more about them visit this link.

import arabic_reshaper
from bidi.algorithm import get_display
import matplotlib.pyplot as plt
%matplotlib inline

def preprocessText(text,stopwords,wordcloud=False):
    noStop=removeStopWords(text,stopwords)
    noPunctuation=removePunctuation(noStop)
    if wordcloud:
        text=arabic_reshaper.reshape(noPunctuation)
        text=get_display(text)
        return text
    return noPunctuation

drawWordcloud(stories.story,stopwords_arabic)

Word Cloud of Hespress News Articles

Since this dataset contains recent news articles we see “كورونا” (coronavirus) as a recurring word. There’s also “الامازيغية” which is a major language in Morocco, “محمد” which is the most popular name in Morocco and is also the name of the King of Morocco and “الحكومة” which means the government.

Feature engineering:

Machine learning models are in their essence mathematical equations and can’t understand text, so before running our models we need to transform our text to numbers, there’s multiple approaches to do this let’s discover the 2 most popular ones.

Word Count:

This one is very simple, every columns represents a word from the entire stories corpus, and every row represents a story, the cell values are the frequency in which a word appears in the story!

TF–IDF:

TF-IDF stands for “Term Frequency Inverse Document Frequency” it uses a slightly more complicated approach which will penalize common words that occur in multiple documents.

We will be using TF-IDF since it in most cases it yields better performance!

from sklearn.feature_extraction.text import TfidfVectorizer

#Clean the stories 
stories["storyClean"]=stories["story"].apply(lambda s: preprocessText(s,stopwords_arabic))

#Vectorize the stories
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(stories["storyClean"])
y=stories.topic

Modelling:

We will try the following models:

Random Forest
Logistic Regression
SGDClassifier
Multinomial Naïve Bayes

We will run the data through each model and use the accuracy which is the ratio of correct predictions and total datapoints as our metric, for more accurate results we have used cross validation with 5 folds for our scoring then we will be plotting the results.

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn.metrics import classification_report

def testModel(model,X,y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model.fit(X_train,y_train)
    modelName = type(model).__name__
    pred=model.predict(X_test)
    print(modelName)
    print(classification_report(y_test,model.predict(X_test)))
    score=np.mean(cross_val_score(model, X, y, cv=5))

    return model,{"model":modelName,"score":score}

Models accuracy

Our best model is SDGClassifier with an accuracy of 87 %

Model Interpretation:

Now that we got a working model let’s try to understand a bit more what’s happening, for that we will be answering two questions:

What topics does our model struggle with?
What words are most influential in predicting different topics?

For the first questions we can check the classification report of our best model:

Classification Report SGDClassifier

We are predicting “Sport”, “Art”, “Medias”, “Tamazight” with an extremely high accuracy. We are struggling the most with “orbites” (world news), “societe” (Society) this might be because these two are more general and broad topics.

To answer the second question, we will be using a useful property of logistic regression, we can use the weights as a measure of the importance of the words in each model. “ELI5” a python library makes it easy to do that:

We can see that most of the words make sense and correspond to the theme of the topic, for example for “Art” the top words are: “Artist”, “Film”,” Culture”, ”Book”.

Conclusion:

In this article, we’ve gone through all the steps required to design a text classification system for Arabic from Data Exploration to Model Interpretation. However, we can still improve our accuracy by tuning the hyperparameters.

In the next article, we’ll try to make sense of the comments on each article using Sentimental Analysis.

if you managed to get here Congratulations. Thanks for reading, I hope you’ve enjoyed the article. For personal contact or discussion, feel free to reach out to me on LinkedIn.

What You Should Know About Ensemble Learning

Sun, 27 Sep 2020 22:40:32 GMT

The wisdom of the crowds for machines

Photo by Markus Spiske on Unsplash

Introduction:

You want to organize a movie night with your friends and you’re looking for the perfect movie, you search on Netflix and you stumble upon one that caught your attention. To decide if the movie is worth watching or not you have multiple options.

Option A: Go ask your brother who has already watched the movie.

Option B: Go to IMDB check the rating & read multiple hopefully spoiler free reviews.

You’ll obviously go with option B since the risk of getting a biased opinion is less if you get multiple points of view as opposed of a single opinion from your brother. This is the idea and motivation behind ensemble methods. It’s the wisdom of crowds! Now let’s dive into a more technical definition of ensemble learning.

Photo by Arian Darvishi on Unsplash

What is ensemble learning:

According to scholarpedia:

Ensemble learning is the process by which multiple models, such as classifiers or experts, are strategically generated and combined to solve a particular computational intelligence problem.

Which means taking the generating multiple models and taking their opinion smart ways such as to get the best prediction possible. In theory, an ensemble model will always outperform a single model. For this to effectively work the individual models constructing an ensemble should be different, it’s no point taking the collective opinion if all individual opinions are the same. We can differentiate our models by using different algorithms, changing the hyper parameters, or training them on different parts of our dataset.

How do we ensemble learn (techniques):

Bagging:

Stands for “bootstrap aggregating” and it’s one of the simplest and most intuitive techniques to understand. In bagging we will be using the same algorithm while training on different subset of the data. To get these subsets we use a technique called bootstrapping:

Basic bootstrapping illustration, Image by Author

As you can see Apple is repeated 2 times. In practice we often choose a smaller size for the bootstrapped datasets. After creating some bootstrapped datasets we will a model on each then combine them to make an ensemble model, this is called aggregation.** **For classification problems the class with the most votes is the prediction and for regression problems we average the output of our models.

Boosting:

While bagging can be done in parallel (just train all your models at the same time), boosting is an iterative process. Like bagging we will be using the same algorithm, but we won’t be bootstrapping the data and training all the models at the same time. Boosting is sequential which means train models one by one and the performance of the previous model will impact how we select the training dataset for the next model, more precisely each new model will try to correct mistakes made by its predecessor

The basic workings of boosting, Image by Author

Popular algorithms that implement boosting are AdaBoost and Gradient Boosting.

Stacking:

This one is simple, we will be using different algorithms and just combining their predictions.

Basic workings of Stacking, Image by Author

Why should you ensemble learn?

As intuition and practice confirms ensemble methods yield more accurate results and when used wisely are more resilient to overfitting thus, they are widely used in Kaggle competitions. One drawback is that they require a lot more time to train.

Summary

Ensemble learning is turning multiple weak models to one strong model “together we are stronger”. Multiple techniques have been developed to accomplish this such as bagging, boosting and stacking. An ensemble model is always more accurate than a single model and can generalise better.

I hope you’ve got a basic idea behind ensemble models. Now it’s time to implement then into your projects!

Thanks for reading! ❤

Follow me for more informative data science content.

ML Basics : Loan Prediction

Thu, 06 Jun 2019 22:40:32 GMT

The complete Data Science pipeline on a simple problem

Photo by Dmitry Demidko on Unsplash

The problem:

Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan.

The Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers.

It’s a classification problem , given information about the application we have to predict whether the they’ll be to pay the loan or not.

We’ll start by exploratory data analysis , then preprocessing , and finally we’ll be testing different models such as Logistic regression and decision trees.

The data consists of the following rows:

**Loan_ID : **Unique Loan ID

**Gender : **Male/ Female

**Married : **Applicant married (Y/N)

**Dependents : **Number of dependents 

**Education : **Applicant Education (Graduate/ Under Graduate)

**Self_Employed : **Self employed (Y/N)

**ApplicantIncome : **Applicant income

**CoapplicantIncome : **Coapplicant income

**LoanAmount : **Loan amount in thousands of dollars

**Loan_Amount_Term : **Term of loan in months

**Credit_History : **credit history meets guidelines yes or no

**Property_Area : **Urban/ Semi Urban/ Rural

**Loan_Status : **Loan approved (Y/N) this is the target variable

Exploratory data analysis:

We’ll be using seaborn for visualisation and pandas for data manipulation. You can download the dataset from here : https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/

We’ll import the necessary libraries and load the data :

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import numpy as np

train=pd.read_csv("train.csv")
test=pd.read_csv("test.csv")

We can look at few top rows using the head function

train.head()

Image by Author

We can see that there’s some missing data , we can further explore this using the pandas describe function:

train.describe()

Image by Author

Some variables have missing values that we’ll have to deal with , and also there seems to be some outliers for the Applicant Income , Coapplicant income and Loan Amount . We also see that about 84% applicants have a credit_history. Because the mean of Credit_History field is 0.84 and it has either (1 for having a credit history or 0 for not)

It would be interesting to study the distribution of the numerical variables mainly the Applicant income and the loan amount. To do this we’ll use seaborn for visualization.

sns.distplot(train.ApplicantIncome,kde=False)

Image by Author

The distribution is skewed and we can notice quite a few outliers.

Since Loan Amount has missing values , we can’t plot it directly. One solution is to drop the missing values rows then plot it, we can do this using the dropna function

sns.distplot(train.ApplicantIncome.dropna(),kde=False)

Image by Author

People with better education should normally have a higher income, we can check that by plotting the education level against the income.

sns.boxplot(x='Education',y='ApplicantIncome',data=train)

Image by Author

The distributions are quite similar but we can see that the graduates have more outliers which means that the people with huge income are most likely well educated.

Another interesting variable is credit history , to check how it affects the Loan Status we can turn it into binary then calculate it’s mean for each value of credit history . A value close to 1 indicates a high loan success rate

#turn loan status into binary 
modified=train
modified['Loan_Status']=train['Loan_Status'].apply(lambda x: 0 if x=="N" else 1 )
#calculate the mean
modified.groupby('Credit_History').mean()['Loan_Status']

OUT : 
Credit_History
0.0    0.078652
1.0    0.795789
Name: Loan_Status, dtype: float64

People with a credit history a way more likely to pay their loan, 0.07 vs 0.79 . This means that credit history will be an influential variable in our model.

Data preprocessing:

The first thing to do is to deal with the missing value , lets check first how many there are for each variable.

train.apply(lambda x: sum(x.isnull()),axis=0)
OUT:
Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

For numerical values a good solution is to fill missing values with the mean , for categorical we can fill them with the mode (the value with the highest frequency)

#categorical
train['Gender'].fillna(train['Gender'].mode()[0], inplace=True)
train['Married'].fillna(train['Married'].mode()[0], inplace=True)
train['Dependents'].fillna(train['Dependents'].mode()[0], inplace=True)
train['Loan_Amount_Term'].fillna(train['Loan_Amount_Term'].mode()[0], inplace=True)
train['Credit_History'].fillna(train['Credit_History'].mode()[0], inplace=True)
train['Self_Employed'].fillna(train['Self_Employed'].mode()[0], inplace=True)
#numerical

df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)

Next we have to handle the outliers , one solution is just to remove them but we can also log transform them to nullify their effect which is the approach that we went for here. Some people might have a low income but strong CoappliantIncome so a good idea is to combine them in a TotalIncome column.

train['LoanAmount_log']=np.log(train['LoanAmount'])
train['TotalIncome']= train['ApplicantIncome'] +train['CoapplicantIncome'] train['TotalIncome_log']=np.log(train['TotalIncome'])

plotting the histogram of loan amount log we can see that it’s a normal distribution!

Image by Author

Modeling:

We’re gonna use sklearn for our models , before doing that we need to turn all the categorical variables into numbers. We’ll do that using the LabelEncoder in sklearn

from sklearn.preprocessing import LabelEncoder
category= ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status'] 
encoder= LabelEncoder()
 for i in category:   
  train[i] = encoder.fit_transform(train[i]) 
  train.dtypes

OUT:
Loan_ID               object
Gender                 int64
Married                int64
Dependents             int64
Education              int64
Self_Employed          int64
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area          int64
Loan_Status            int64
LoanAmount_log       float64
TotalIncome          float64
TotalIncome_log      float64
dtype: object

Now all our variables have became numbers that our models can understand.

To try out different models we’ll create a function that takes in a model , fits it and mesures the accuracy which means using the model on the train set and mesuring the error on the same set . And we’ll use a technique called Kfold cross validation which splits randomly the data into train and test set, trains the model using the train set and validates it with the test set, it will repeat this K times hence the name Kfold and takes the average error. The latter method gives a better idea on how the model performs in real life.

#Import the models
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold   #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics

def classification_model(model, data, predictors, outcome):
  #Fit the model:
  model.fit(data[predictors],data[outcome])
  
  #Make predictions on training set:
  predictions = model.predict(data[predictors])
  
  #Print accuracy
  accuracy = metrics.accuracy_score(predictions,data[outcome])
  print ("Accuracy : %s" % "{0:.3%}".format(accuracy))

#Perform k-fold cross-validation with 5 folds
  kf = KFold(data.shape[0], n_folds=5)
  error = []
  for train, test in kf:
    # Filter training data
    train_predictors = (data[predictors].iloc[train,:])
    
    # The target we're using to train the algorithm.
    train_target = data[outcome].iloc[train]
    
    # Training the algorithm using the predictors and target.
    model.fit(train_predictors, train_target)
    
    #Record error from each cross-validation run
    error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
 
  print ("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))

Now we can test different models we’ll start with logistic regression:

outcome_var = 'Loan_Status'
model = LogisticRegression()
predictor_var = ['Credit_History','Education','Married','Self_Employed','Property_Area']
classification_model(model, train,predictor_var,outcome_var)
OUT : 
Accuracy : 80.945%
Cross-Validation Score : 80.946%

We’ll try now a Decision tree which is should give us more accurate result

model = DecisionTreeClassifier() predictor_var = ['Credit_History','Gender','Married','Education'] classification_model(model, df,predictor_var,outcome_var)

OUT:
Accuracy : 80.945%
Cross-Validation Score : 78.179%

We’ve got the same score on accuracy but a worse score in cross validation , a more complex model doesn’t always means a better score.

Finally we’ll try random forests

model = RandomForestClassifier(n_estimators=100)
predictor_var = ['Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'Loan_Amount_Term', 'Credit_History', 'Property_Area',
        'LoanAmount_log','TotalIncome_log']
classification_model(model, train,predictor_var,outcome_var)

OUT: 
Accuracy : 100.000%
Cross-Validation Score : 78.015%

The model is giving us perfect score on accuracy but a low score in cross validation , this a good example of over fitting. The model is having a hard time at generalizing since it’s fitting perfectly to the train set.

Solutions to this include : Reducing the number of predictors or Tuning the model parameters.

Conclusion:

We’ve gone through a good portion of the data science pipe line in this article, namely EDA , preprocessing and modeling and we’ve used essential classification models such as Logistic regression , Decision tree and Random forests. It would be interesting to learn more about the backbone logic behind these algorithms, and also tackle the data scraping and deployment phases.We’ll try to do that in the next articles.

ML Basics: predicting house prices

Sun, 12 May 2019 22:40:32 GMT

What’s machine learning:

In simple terms , it’s the process of teaching machines to solve particular problems without being explicitly programmed .

Sounds fascinating ,but how does one teaches a machine? The answer is using math ,some smart people have figured out ways to simulate how humans learn which is by observation. The core of the machine learning process reduces to feeding a machine learning model a bunch of observations with the corresponding labels which we call “training” . then testing the model observations that it didn’t see in the training phase which we call “validation”, a better model has more accurate validation results.

Example : Teach a machine how tell if a picture is a cat or a dog

Step 1 : get a huge number of pictures of cats and dogs and classify them yourself

Step 2 : feed an ML model the pictures , watch it learn.

Step 3 : get new pictures of cats and dogs and test if your models perform well

The competition:

“House prices” is a kaggle competition under the knowledge section , it is meant for beginners to practice their datascience skills . The objective is to predict a house’s price given a bunch of information about it for example : it’s area ,pool’s availability …

It’s pretty complicated to tackle this kind of challenges without proper background , so in this article we’ll go through the typical machine learning process while simplifying any ambiguous statistical terms , so only basic math skills will be required.

We’ll start by exploratory data analysis which aims to get a feel of the data by observing ,analyzing it using graphs , this will help us identify important features , spot irregularities …

Then we’ll do a little bit of data cleaning and preprocessing, so we’ll fix any problems with the data and prepare it to be swollen by our model

Finally , we’ll use our clean data and feed it to a model of our choice , in this tutorial we’ll be using a simple linear regression model , then we will explore the different ways to evaluate our model’s performance and we’ll also try to improve it.

Exploratory data analysis:

The first step is to download the dataset from the competition’s webite :

House Prices: Advanced Regression Techniques

Predict sales prices and practice feature engineering, RFs, and gradient boosting

www.kaggle.com

We’ll get “train.csv”, “test.csv” and “data_description.txt” which explains what each column means.

Then import the required libraries : seaborn and matplotlib for visualisation , pandas and numpy for data wrangling

import pandas as pd  
import matplotlib.pyplot as plt  
import seaborn as sns  
import numpy as np  
%matplotlib inline

We can use Pandas to read in csv files. The pd.read_csv() method creates a DataFrame from a csv file.

train = pd.read_csv('train.csv')  
test = pd.read_csv('test.csv')

Let’s check the size of the data:

print ("Train size:", train.shape)  
print ("Test size:", test.shape)Train size: (1460, 81)  
Test size: (1459, 80)

We can see that the test data has one missing column which is the price of the house which makes sense cause that’s what we need to predict in the competition.

Now we’ll look at a few rows of the data using DataFrame.head() method.

train.head()

We can notice that some of the columns such as PoolQC have missing values. We’ll deal with that later.

To make some sense of the column names we can check the data description file. Here’s a brief version of what you’ll find there.

SalePrice — the property’s sale price in dollars. This is the target variable that we’re trying to predict.
MSSubClass — The building class
MSZoning — The general zoning classification
LotFrontage — Linear feet of street connected to property
LotArea — Lot size in square feet
Street — Type of road access

We’re trying to predict the salePrice column using all the other available columns , to get more information about our target variable we can use the describe command

train[salePrice].describe()out :count      1460.000000  
mean     180921.195890  
std       79442.502883  
min       34900.000000  
25%      129975.000000  
50%      163000.000000  
75%      214000.000000  
max      755000.000000  
Name: SalePrice, dtype: float64

count gives the number of price observations available , the mean is the average sale price, we also get the standard deviation which is a measure of the dispersion in prices , we also get the min , max , and percentiles (explain this later)

We’ll dive deeper in the salePrice analysis by checking the plotting a historgram and checking it’s skew value.

plt.rcParams['figure.figsize'] = [15, 10]  
sns.distplot(train['SalePrice']);  
print("Skewness: %f" % df['SalePrice'].skew())

a histogram of the sale price

Skewness, is the degree of distortion from a normal distribution, in a set of data. A distribution with 0 skewness is perfectly symmetrical. A positive skewness indicates an assymetry to the left and a negative one is to the right

Skewness is a problem because it can make our linear regression model inaccurate. We’ll be dealing with it in the preprocessing phase.

To get a feel of the data we’ll plot some variables and see their effect on price.

We’ll start by the living Area

sns.scatterplot(x='GrLivArea',y='SalePrice',data=df)

There’s a clear linear relationship , which is good for our model. We can also see some outliers ( Some houses with really large areas and low price) .Outliers can damage the quality of the model so we’ll have to delete them.

We’ll now check the salePrice vs the Overall quality

sns.boxplot(x='OverallQual',y='SalePrice',data=df)

As expected when the quality increases so does the sale price

Finally , to identify the most important variables we’ll check the correlation matrix and rank the variables based on their correlation with the target variable.

#correlation heatmap  
sns.heatmap(df.corr())  
#correlations sorting  
#top correlated variables  
df.corr()['SalePrice'].sort_values(ascending=False)

Top correlated variables :

SalePrice        1.000000  
OverallQual      0.790982  
GrLivArea        0.708624  
GarageCars       0.640409  
GarageArea       0.623431  
TotalBsmtSF      0.613581  
1stFlrSF         0.605852  
FullBath         0.560664  
TotRmsAbvGrd     0.533723  
YearBuilt        0.522897  
YearRemodAdd     0.507101  
GarageYrBlt      0.486362  
MasVnrArea       0.477493  
Fireplaces       0.466929  
BsmtFinSF1       0.386420  
LotFrontage      0.351799  
WoodDeckSF       0.324413  
2ndFlrSF         0.319334  
OpenPorchSF      0.315856  
HalfBath         0.284108  
LotArea          0.263843  
BsmtFullBath     0.227122  
BsmtUnfSF        0.214479  
BedroomAbvGr     0.168213  
ScreenPorch      0.111447  
PoolArea         0.092404  
MoSold           0.046432  
3SsnPorch        0.044584

Data preprocessing :

Handling Null Values:

Next, we’ll examine the null or missing values. We’ll check their number across various variables and also an important mesure which is the percentage of null values of the column’s data.

#missing data count and percentage  
total = train.isnull().sum().sort_values(ascending=False)  
percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)  
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])  
missing_data.head(20)

We see that for the PoolQc , MiscFeature , Alley and Fence most the datapoints are null. Althought not the best path , one way to deal with missing data is to fill it with coulumn’s mean which we can do easily using fillna() method.

train = train.fillna(all_data.mean())

Removing outliers:

When we visualized the living area vs SalePrice in the EDA section we found few datapoints that clearly don’t follow the trend, in statistics we call them outliers and they’re can make the model less accurate. In our case to remove those datapoints we can target the houses which the area exceeds 4000 m²

#remove outliers  
train = train[train.GrLivArea < 4000]  
sns.scatterplot(x=df.GrLivArea, y=df.SalePrice)

Handling skewness:

We found positive skewness in the salePrice , to deal with that a common method is to use the log transform. To do that we can use the np.log1p() function. Then we plot again to check if that worked.

train.SalePrice = np.log1p(train.SalePrice)sns.distplot(df['SalePrice'], fit=norm);  
fig = plt.figure()

As you can see our plot in blue is now very close to a normal distribution !

There’s indeed more variables with skewness that we’d like to remove. A good way to do that is to mesure their skewness and apply the log transform to variables which the skewness exceeds a certain value.

#log transform all the numerical skewed data
#get all numerical features  
numeric_feats = train.dtypes[train.dtypes != "object"].indexskewed_feats = train[numeric_feats].apply(lambda x: x.skew()) #compute skewnessskewed_feats = skewed_feats[skewed_feats > 0.75]skewed_feats = skewed_feats.indexprint(skewed_feats)train[skewed_feats] = np.log1p(train[skewed_feats])out : Index(['MSSubClass', 'LotFrontage', 'LotArea', 'MasVnrArea','BsmtFinSF1','BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF','LowQualFinSF', 'GrLivArea', 'BsmtHalfBath','KitchenAbvGr','TotRmsAbvGrd', 'WoodDeckSF', 'OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch''PoolArea','iscVal'],dtype='object')

Turning categorical columns into dummy variables:

Linear regression models can’t handle categorical data , so a common way to solve that probem is to turn categories into new binary columns . For example a column for sex with “male” and “female” will turn into two binary columns named “male” and “female” which can take 0 or 1 as values. We can do that easily in pandas using the built in pd.get_dummies() function.

train = pd.get_dummies(train)

Modeling :

The final step is modeling we’ll be building a simple linear model

from sklearn import datasets, linear_model  
from sklearn.metrics import mean_squared_error, r2_score  
# Create linear regression object  
regr = linear_model.LinearRegression()X_train=train[:730]  
Y_train=y[:730]X_test=train[730:]  
Y_test=y[730:]# Train the model using the training sets  
regr.fit(X_train, Y_train)# Make predictions using the testing set  
pred = regr.predict(X_test)print("Mean squared error: %.9f" % mean_squared_error(newYtest, pred))out :  
Mean squared error: 0.001315085

We’ve found a 0.001 means squared error , but what does that mean?

The mean squared error tells how close a regression line is to a set of points. and does this by taking the distances from the points to the regression line and squaring them. The squaring is necessary to remove any negative signs. It also gives more weight to larger differences. It’s called the mean squared error as you’re finding the average of a set of errors.

Conclusion:

Throughout this article we’ve looked at how to deal with machine learning problem , we’ve gone through all the steps required to solve one from the Exploratory data analysis , data preprocessing to the modeling .To improve the accuracy we could’ve done some feature engineering (creating new features from the features we have) or have used more complex models. I hope this introduction was of great help !

Bubble sort for dummies

Sun, 29 Jul 2018 22:40:32 GMT

We’ll have fun exploring one of the most simple sorting algorithms! Bubble sort

Do we really need sorting algorithms?

Humans are indeed an intelligent specie.We crave on organizing every aspect of our life.In modern times digital life has become as influential as the real one. The solution to organize this online mess is through the use of sorting algorithms. These pieces of coded logic are literally everywhere on the internet. You want to check the latest post on your favorite blog? Well just press the button to sort them by new. You want to find out the cheapest toothbrush on an E-commerce website? Just sort them by price !

The most important aspect about a sorting algorithm is its speed, no one want to wait decades to get his emails sorted! Fortunately today’s computers are really fast but still only the fastest sorting algorithms are practically used. In this post we will talk about the slowest one. This algorithm is a the best introduction to sorting because of its simplicity but its never used in practice.

Bubble Sort intuition:

Tell me and I forget, teach me and I may remember, involve me and I learn.

One of the best ways to learn an algorithm is to find it out yourself. So in this section we’ll try to invent bubble sort! Are you ready to make some bubbles?

You have an initial unordered list of numbers. The objective is to sort them! You can perform 2 simple actions. Comparing 2 elements of the list and swapping them. Can you come up with a simple algorithm to sort the list only using those 2 actions?

get a sheet of paper and think it out , it’s worth it !

How to Bubble Sort?

Hope you had fun inventing algorithms! If you’re lucky you have already came up with bubble sort !

Bubble sort is comparison based, you basically compare each element with the next one . If the current element is smaller than the next element you swap them if not you do not swap and go to the next element.

When you reach the end of the array you go back to the first element and repeat the process. Stop when the array is sorted !

Bubble sort on the example array

You could ask yourself. Well how many repetitions should I perform? It turns out that the maximum needed is (length of the array -1) for our example if the array we had to do 2 repetitions , if the array was completely disordered we would have to do 3!

{{ … }} You could ask yourself. Well how many repetitions should I perform? It turns out that the maximum needed is (length of the array -1) for our example if the array we had to do 2 repetitions , if the array was completely disordered we would have to do 3!

Bubble sort in code:

Finally here’s an implementation of bubble sort in code.

def bubbleSort(arr):
    #get the length of the array
    n = len(arr)
    # Traverse through all the elements of the array
    for i in range(n):
        for j in range(0, n-1):
            # if the current element is larger than the next one swap
            if arr[j] > arr[j+1] :
                #this is the python shorcut for swapping
                arr[j], arr[j+1] = arr[j+1], arr[j]

How fast is bubble sort?

Well as expected it turns out that bubble sort is really slow compared to the more optimized algorithms. In computer science to find out how fast is an algorithm we use the big O notation. Basically it measures how much steps does an algorithm takes in the worst case scenario. Bubble sort checks all the elements in the array which has a length of let’s say **n,**and repeats this for n-1 times in the worst case scenario so the total steps needed is n² -n .

For large numbers n² is actually much bigger than n “you can test it out using a calculator” so we could ignore the n and say that bubble sort has a complexity of O(n²).

The best algorithms most used algorithms are quicksort and mergesort these can sort in O(n*log(n)) . These will always outperform bubble sort.

To check this you can calculate n² and n*log(n) let’s try that:

if we choose n=10
n²=100      and     n*log(n)=10
now for n=1000
n²=1000000   and n*log(n)=3000

In this post we learned how bubble sort works . It might be a snail in terms of speed but it’s essential to understand to tackle the more complex algorithms!

I hope this post helped you to sort your bubbles !

Algorithmic corner : Linear regression

Fri, 08 Jun 2018 22:12:03 GMT

The basics:

In this article we’ll try to uncover how linear regression works. The best way to understand it is through example. Suppose we have the following problem , we are trying to predict a student’s grade given how many times he didn’t attend the class. With enough data points we’ll end up with a graph that looks like this :

Doing a linear regression is finding the line that is closest to all the data points, in mathematics the equation of a line is y=ax+b where “a” is the slope and “b” is the intercept. So to find this line we have to find the best “a” and “b” coefficients. But how do we do that, and what does the “best line” means concretely?

Least square regression:

The best line is one that is closest to all data points , in other terms it’s the line that minimizes the sum of the distances between each point and the fitted line. We can see that visually below:

The way to calculate this error is by getting the difference between an observed point and a predicted point (using the line) squaring it and summing this for all data points. Mathematically it looks like :

Using calculus we can easily get the parameters “a” and “b” for the best line.

A good thing about linear regression is that it generalizes easily to problems of higher dimension .Its about adding more terms to the equation and calculating more coefficients. A general model looks like this :

Model evaluation: R squared:

How do we determine how well the model fits the data ? One way is to calculate the R² factor.

R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.

In simple terms R squared will give a measure on how better our model is than a model fits the data with it’s mean value. Generally higher values or R-squared are more desirable. We can measure R-squared on the data we used for training but that doesn’t reflect on how well the model will perform in real life, so a good idea is to split the data into training and test and calculate R-squared for both. Generally we’ll observe that model performs better on the training data. Another way to access the model’s performance is through the root mean squared error, it tells you how concentrated the data is around the regression line. The lower this error the better the model, we can calculate it with the formula :

Linear regression in python:

To apply what we learned we’ll be using a machine learning library in python called skLearn , and the dataset we’re gonna use is about automobile data. The problem is to predict an automobile price based on it’s characteristics. The data looks like this :

The cleaning part is already done so we’re gonna test the models directly. We’ll start by a simple linear regression model.We’ll be splitting the data into test and train. 80% of the data for training and 20% for testing and we’ll check our R-squared score on the training set.

from sklearn.model_selection import train_test_split
X = auto_data.drop('price', axis=1)  
Y = auto_data['price']  
X_train, x_test, Y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)  
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()  
linear_model.fit(X_train, Y_train)  
#Checking the score
linear_model.score(X_train, Y_train)
# OUT: 0.96792273709243304

We got a really high score on the training set , what about the test set?

y_predict = linear_model.predict(x_test)  
%pylab inline  
pylab.rcParams['figure.figsize'] = (15, 6)
plt.plot(y_predict, label='Predicted')  
plt.plot(y_test.values, label='Actual')  
plt.ylabel('Price')
plt.legend()  
plt.show()

It doesn’t look that good graphically lets check the score

r_squared = linear_model.score(x_test, y_test)  
r_squared
# OUT: 0.63225834161155436

We’ve got a low score , this is known in ML terms as over-fitting the model learned the training set so well that it struggling at generalization. So how can remedy this problem. Well there’s another form of regression that attempts to solve this issue and it’s called Lasso Regression. Instead of minimizing the sum of the errors it adds a penalty term on the coefficients as to force them to be small. Concretely the algorithm will minimize this :

Where α is a parameter we choose. Let’s try it out with an α of 0.5 :

from sklearn.linear_model import Lasso
lasso_model = Lasso(alpha=0.5, normalize=True)  
lasso_model.fit(X_train, Y_train)  
lasso_model.score(X_train, Y_train)  
# OUT: 0.96510812725275497

We’ve got a slightly lower score on the training set. Let’s try the model on the test set:

y_predict = lasso_model.predict(x_test)  
%pylab inline  
pylab.rcParams['figure.figsize'] = (15, 6)
plt.plot(y_predict, label='Predicted')  
plt.plot(y_test.values, label='Actual')  
plt.ylabel('Price')
plt.legend()  
plt.show()

This time it seems to fit better let’s check the R-squared value:

r_square = lasso_model.score(x_test, y_test)  
r_square  
# OUT: 0.887194953444848

The R-squared score is way better than the simple linear model. We can further improve the performance by tweaking the α parameter. Finding the best parameters for a model is called hyper-parameter tuning and there’s functions in sklearn that makes it easy to find these.

Conclusion:

In this article we’ve covered how linear regression works , some ways to access it’s performance ,the over-fitting problem and one solution to overcome it. I hope this was of great use to you, in the next article we’ll tackle another algorithm which is logistics regression.

If you liked this article, be sure to click ❤ below to recommend it and if you have any questions, leave a comment and I will do my best to answer.

Train Price Trends

Fri, 08 Jun 2018 22:12:03 GMT

The basics:

Least square regression:

Using calculus we can easily get the parameters “a” and “b” for the best line.

Model evaluation: R squared:

How do we determine how well the model fits the data ? One way is to calculate the R² factor.

R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.

Linear regression in python:

from sklearn.model_selection import train_test_splitX = auto_data.drop('price', axis=1)  
Y = auto_data['price']  
X_train, x_test, Y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)  
from sklearn.linear_model import LinearRegressionlinear_model = LinearRegression()  
linear_model.fit(X_train, Y_train)  
#Checking the scorelinear_model.score(X_train, Y_train)OUT:  
0.96792273709243304

We got a really high score on the training set , what about the test set?

y_predict = linear_model.predict(x_test)  
%pylab inline  
pylab.rcParams['figure.figsize'] = (15, 6)plt.plot(y_predict, label='Predicted')  
plt.plot(y_test.values, label='Actual')  
plt.ylabel('Price')plt.legend()  
plt.show()

It doesn’t look that good graphically lets check the score

r_squared = linear_model.score(x_test, y_test)  
r_squaredOUT:  
0.63225834161155436

Where α is a parameter we choose. Let’s try it out with an α of 0.5 :

from sklearn.linear_model import Lassolasso_model = Lasso(alpha=0.5, normalize=True)  
lasso_model.fit(X_train, Y_train)  
lasso_model.score(X_train, Y_train)  
OUT:  
0.96510812725275497

We’ve got a slightly lower score on the training set. Let’s try the model on the test set:

y_predict = lasso_model.predict(x_test)  
%pylab inline  
pylab.rcParams['figure.figsize'] = (15, 6)plt.plot(y_predict, label='Predicted')  
plt.plot(y_test.values, label='Actual')  
plt.ylabel('Price')plt.legend()  
plt.show()

This time it seems to fit better let’s check the R-squared value:

r_square = lasso_model.score(x_test, y_test)  
r_square  
OUT:  
0.887194953444848

Conclusion:

If you liked this article, be sure to click ❤ below to recommend it and if you have any questions, leave a comment and I will do my best to answer.