<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Tariq Massaoudi]]></title><description><![CDATA[Senior Software Engineer specializing in GenAI, RAG architectures, and MLOps. Building intelligent systems on Azure and AWS.]]></description><link>https://www.tariqmassaoudi.com</link><generator>GatsbyJS</generator><lastBuildDate>Fri, 20 Feb 2026 11:57:38 GMT</lastBuildDate><item><title><![CDATA[The Ultimate Guide to Rate Limiting: Algorithms, Use Cases, and Cloud Solutions]]></title><description><![CDATA[by ChatGPT Introduction When building an API or any system that handles large volumes of requests, one crucial challenge you’ll face is how…]]></description><link>https://www.tariqmassaoudi.com/rate-limiting/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/rate-limiting/</guid><pubDate>Mon, 26 May 2025 22:12:03 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/0*88St4J0kT3Y2QmNV&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;by ChatGPT&lt;/p&gt;
&lt;h2&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;When building an API or any system that handles large volumes of requests, one crucial challenge you’ll face is how to manage and control traffic. Enter rate limiting — the process that ensures your system doesn’t get overwhelmed by too many requests at once. Whether it’s to prevent abuse, ensure fairness, or just to keep things running smoothly, understanding the right way to implement rate limiting is essential. This article will walk you through the different types of rate limiters, their real-world applications, and how to design an effective one for your system.&lt;/p&gt;
&lt;h2&gt;How Rate Limiting Works and Why Use It&lt;/h2&gt;
&lt;h3&gt;How Rate Limiting Works&lt;/h3&gt;
&lt;p&gt;Rate limiting typically involves tracking the number of requests a user or client makes within a specified time frame (like seconds, minutes, or hours). If the user exceeds the allowed number of requests, the system blocks or delays the excess requests until the next time window begins.&lt;/p&gt;
&lt;p&gt;Here’s a simple flow of how it works:&lt;/p&gt;
&lt;ol class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;Request is made&lt;/strong&gt;: A user sends a request to the system.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Check request count&lt;/strong&gt;: The system checks how many requests the user has made in the current time window.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Check against limit&lt;/strong&gt;: If the user has made too many requests, the system responds with an error (commonly HTTP 429 — Too Many Requests). If the limit hasn’t been reached, the request is processed as usual.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Window resets&lt;/strong&gt;: Once the time window expires, the request count is reset, and the user can make new requests within the limit.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Press enter or click to view image in full size&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/0*UfMd7g5n1GtnaPKL.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;How rate limiting works&lt;/p&gt;
&lt;p&gt;Depending on the algorithm used, the method for counting and handling requests varies, but the basic principle remains the same.&lt;/p&gt;
&lt;h3&gt;Why Use Rate Limiting?&lt;/h3&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;Prevent Overload&lt;/strong&gt;:&lt;br&gt;
Too many requests at once can overwhelm your servers, leading to crashes or degraded performance. By controlling the flow of traffic, rate limiting ensures that your system can handle the load without going down.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fairness&lt;/strong&gt;:&lt;br&gt;
Without rate limiting, some users could hog resources, leaving others with a poor experience. By limiting the number of requests, you ensure that all users get a fair share of the system’s capacity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Protect from Abuse&lt;/strong&gt;:&lt;br&gt;
Rate limiting helps prevent malicious users from exploiting your system. For example, a malicious actor could try to flood your API with requests to crash it or scrape sensitive data. Rate limiting ensures they can’t make too many requests in a short time.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Key Rate Limiting Algorithms&lt;/h2&gt;
&lt;p&gt;When choosing a rate limiter, the algorithm you pick depends on your use case. Each approach comes with its own advantages and trade-offs. Let’s take a look at the most common algorithms used in rate limiting, and when you might want to use them.&lt;/p&gt;
&lt;h3&gt;1. Token Bucket&lt;/h3&gt;
&lt;p&gt;The  &lt;strong&gt;Token Bucket&lt;/strong&gt;  algorithm is one of the most flexible and widely used for rate limiting. It’s designed to allow for bursts of traffic while maintaining a steady flow of requests. Here’s how it works:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Parameters&lt;/strong&gt;:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Bucket capacity: Maximum number of tokens the bucket can hold.&lt;/li&gt;
&lt;li&gt;Token refill rate: Rate at which tokens are added to the bucket (e.g., 1 token per second).&lt;/li&gt;
&lt;li&gt;Request rate: Number of tokens required per request.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;: Tokens are generated at a fixed rate and placed into a bucket. Each incoming request consumes a token. If there are tokens available, the request proceeds. If the bucket is empty, requests are delayed or blocked. The refill rate ensures that the system can handle bursts of traffic by temporarily allowing extra requests.&lt;/p&gt;
&lt;p&gt;Press enter or click to view image in full size&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/1*wskhOm7nOI8jGyjWsPOjKg.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Token bucket visualized&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why use it?&lt;/strong&gt;  The token bucket is perfect for situations where you need to handle bursts of traffic, like when users submit multiple requests within a short period. It allows for burst behavior but limits the overall rate over time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Real-World Use Case&lt;/strong&gt;:&lt;br&gt;
Imagine an online ticketing platform during a flash sale. Users might attempt to book tickets in bulk within a few seconds, creating a surge in requests. The token bucket ensures that the platform can handle the initial burst of requests but throttles back once the tokens are exhausted, preventing overload.&lt;/p&gt;
&lt;h3&gt;2. Leaky Bucket&lt;/h3&gt;
&lt;p&gt;The  &lt;strong&gt;Leaky Bucket&lt;/strong&gt;  algorithm is similar to the token bucket but with a key difference in how traffic is handled. While the token bucket allows bursts and smooths out traffic over time, the leaky bucket enforces a more rigid output rate.&lt;/p&gt;
&lt;h2&gt;Get  Tariq Massaoudi’s stories in your inbox&lt;/h2&gt;
&lt;p&gt;Join Medium for free to get updates from this writer.&lt;/p&gt;
&lt;p&gt;Subscribe&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Parameters&lt;/strong&gt;:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Bucket capacity: Maximum number of requests the bucket can hold.&lt;/li&gt;
&lt;li&gt;Leak rate: Fixed rate at which requests are processed (e.g., 10 requests per second).&lt;/li&gt;
&lt;li&gt;Request arrival rate: Rate at which requests arrive at the system.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;: Requests are added to the bucket. If the bucket overflows (i.e., too many requests arrive), the excess requests are dropped. The leak rate controls how quickly requests are processed and ensures a smooth flow over time.&lt;/p&gt;
&lt;p&gt;Press enter or click to view image in full size&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/1*ZTd2U_eKbJYMb961ef5-NQ.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Leaky bucker visualized&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why use it?&lt;/strong&gt;  The leaky bucket is great when you want to maintain a steady, consistent rate of requests. It’s less flexible than the token bucket but can be ideal for systems that need to avoid sudden spikes in traffic.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Real-World Use Case&lt;/strong&gt;:&lt;br&gt;
Consider a live-streaming service where users upload video content. You don’t want the server to be overwhelmed with too many concurrent uploads, so you regulate the rate at which uploads are processed. This ensures that while multiple users can upload content, the server doesn’t get bogged down by too many uploads at once.&lt;/p&gt;
&lt;h3&gt;3. Fixed Window Counter&lt;/h3&gt;
&lt;p&gt;The  &lt;strong&gt;Fixed Window Counter&lt;/strong&gt;  algorithm is the simplest form of rate limiting. It tracks the number of requests within a fixed time window, and if the number of requests exceeds the threshold, further requests are blocked until the next window starts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Parameters&lt;/strong&gt;:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Time window**:**  The time frame in which requests are counted (e.g., 1 minute).&lt;/li&gt;
&lt;li&gt;Max requests per window: The maximum number of requests allowed within the time window.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;: The system tracks the number of requests made within a fixed time window (e.g., 1 minute). If the number of requests exceeds the limit during that window, the system blocks further requests until the next time window begins.&lt;/p&gt;
&lt;p&gt;Press enter or click to view image in full size&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/0*hbuW5ab8Ef-8JbLJ.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Fixed window counter visualized&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why use it?&lt;/strong&gt;  This algorithm is ideal for applications where traffic is consistent and predictable. It’s simple and effective.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;One major downside of using the  &lt;strong&gt;Fixed Window Counter&lt;/strong&gt;  is the  &lt;strong&gt;spike in traffic at the edges of the window&lt;/strong&gt;. For example, if a user makes 99 requests just before the end of the time window and then another 99 immediately after the window resets, it could result in 198 requests being processed within a very short time, much more than the allowed quota. This can cause unexpected load on the system.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Real-World Use Case&lt;/strong&gt;:&lt;br&gt;
Think of a public API for checking stock prices. Each user is allowed 100 requests per minute. If a user exceeds this limit, they can’t make further requests until the next minute. The fixed window is perfect for this case, where users are making regular requests at a steady rate.&lt;/p&gt;
&lt;h3&gt;4. Sliding Window Log&lt;/h3&gt;
&lt;p&gt;The  &lt;strong&gt;Sliding Window Log&lt;/strong&gt;  algorithm provides more precision by tracking individual request timestamps within a sliding window. It ensures that requests are spread evenly across the time period, avoiding the burst behavior of the fixed window counter.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Parameters&lt;/strong&gt;:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Time window: The length of the sliding window (e.g., 1 minute).&lt;/li&gt;
&lt;li&gt;Max requests: The maximum number of requests allowed within the window.&lt;/li&gt;
&lt;li&gt;Request timestamps: Track the exact time each request was made.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;: Requests are timestamped as they come in. The system tracks how many requests are made within the sliding window (e.g., the last 1 minute). The excess requests are blocked or delayed if the number of requests exceeds the allowed limit within the window.&lt;/p&gt;
&lt;p&gt;Press enter or click to view image in full size&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/0*2olR_8mWUYvpK-R_.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Sliding window counter visualized&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;Why use it?&lt;/strong&gt;  This algorithm is ideal when you need more granular control over request distribution across time. It ensures that requests are evenly distributed within the window, avoiding bursts at the beginning or end.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Real-World Use Case&lt;/strong&gt;:&lt;br&gt;
A mobile banking app allows users to make 10 transactions per day. With the sliding window log, the system ensures that the user doesn’t exceed the transaction limit, regardless of when the transactions are spread out across the day.&lt;/p&gt;
&lt;h2&gt;Rate limiters in the cloud&lt;/h2&gt;
&lt;p&gt;If you’re working with cloud platforms, there’s no need to reinvent the wheel. Both  &lt;strong&gt;AWS&lt;/strong&gt;  and  &lt;strong&gt;Azure&lt;/strong&gt;  offer built-in rate limiting features that are easy to integrate and scale.&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;AWS API Gateway&lt;/strong&gt;: AWS offers built-in rate limiting for APIs. You can set limits on the number of requests per second, minute, or hour per user or API key. It also integrates with  &lt;strong&gt;AWS Lambda&lt;/strong&gt;  for more advanced traffic management.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Azure API Management&lt;/strong&gt;: Azure provides  &lt;strong&gt;API Management,&lt;/strong&gt;  which allows you to enforce rate limits and quotas at the API level. You can define policies to throttle requests based on user or IP address, and scale these limits as needed.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;To wrap things up, rate limiting is crucial for maintaining a smooth, fair, and secure system. Whether you’re dealing with burst traffic or protecting your backend from abuse, rate limiting helps you keep things under control. Of course, there are trade-offs; some algorithms are simpler but less flexible, while others offer more precision but come with added complexity. We’ve covered key algorithms like Token Bucket, Leaky Bucket, Fixed Window Counter, and Sliding Window Log, and seen how they fit different use cases. If you’re in the cloud, AWS API Gateway and Azure API Management offer powerful, managed solutions that take care of the heavy lifting. So, choose the right algorithm or service for your needs, and you’ll have a system that handles traffic efficiently and scales with ease. Thanks for reading, and I hope this article has given you the insights you need to tackle rate limiting in your projects.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[The Definitive Guide to Choosing a Storage Solution: Matching Your Data to the Right Architecture]]></title><description><![CDATA[Generated by ChatGPT Introduction: If you’re building any kind of system, whether it’s a web app, a big data analysis dashboard, or an…]]></description><link>https://www.tariqmassaoudi.com/storage-solution/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/storage-solution/</guid><pubDate>Fri, 09 May 2025 22:12:03 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:1400/format:webp/0*QD_4faOE3SvLSvtG&quot; alt=&quot;Generated by ChatGPT&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Introduction:&lt;/h2&gt;
&lt;p&gt;If you’re building any kind of system, whether it’s a web app, a big data analysis dashboard, or an enterprise backend, you’ll eventually face this question: Where do I store my data?&lt;/p&gt;
&lt;p&gt;This article will be your practical guide for navigation that choice. It’ll walk you through how to determine the right storage solution based on your data type, access patterns, and use case. It will also highlight real world tools as examples on how you would implement such storage solution from cloud providers like AWS, Azure, etc …&lt;/p&gt;
&lt;h2&gt;What’s the structure of your data?&lt;/h2&gt;
&lt;p&gt;The first and most critical question you must ask is what kind of data are we dealing with?&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;Structured Data&lt;/strong&gt;: Well-defined rows and columns. Think of tables with strict schemas. e.g., customer info, product inventories, financial transactions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Semi-structured Data&lt;/strong&gt;: JSON, XML, YAML. Has structure but doesn’t fit neatly into columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unstructured Data&lt;/strong&gt;: Images, videos, audio files, documents. No inherent structure.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;If You Have Structured Data:&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:1400/format:webp/0*FCDEjQdhc3Ov7qSu.png&quot; alt=&quot;OLTP &amp;#x26; OLAP flow&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use Case: OLTP (Online Transaction Processing)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If you’re building a web application or API where users are constantly interacting with the system — logging in, creating accounts, placing orders, updating settings, then you’re working in OLTP territory. These are read/write-heavy operations that need to be fast, reliable, and consistent.&lt;/p&gt;
&lt;p&gt;Think about apps like an e-commerce platform where someone adds items to their cart and checks out, a social media platform where users update their profile…&lt;/p&gt;
&lt;p&gt;In this case, it’s recommended to use a &lt;strong&gt;relational database&lt;/strong&gt;. These are great at enforcing structure (schemas), ensuring consistency (ACID compliant), and handling concurrent operations safely.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Common cloud solutions you’d use here&lt;/strong&gt;:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;AWS RDS&lt;/strong&gt; (supports MySQL, PostgreSQL, etc.): great for managed production environments&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Azure SQL Database&lt;/strong&gt;: scalable and integrates well if you’re in the Microsoft ecosystem&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Google Cloud SQL&lt;/strong&gt;: pairs nicely with App Engine or GKE for app backends&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use Case: OLAP (Online Analytical Processing):&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When your focus shifts from handling transactions to analyzing them — looking for trends, generating reports, building dashboards — you’re now in OLAP territory. This is where you’re less concerned with updating data and more focused on scanning &lt;strong&gt;large volumes&lt;/strong&gt; of it quickly and efficiently.&lt;/p&gt;
&lt;p&gt;Think about scenarios like: A product manager exploring daily sales by category over the past year or a dashboard that shows real-time KPIs across regions, products, and timeframes.&lt;/p&gt;
&lt;p&gt;These use cases often involve aggregations, filters, and joins on massive datasets. The workloads are &lt;strong&gt;read-heavy&lt;/strong&gt;, and they often run on scheduled pipelines or are triggered by end-user dashboards.&lt;/p&gt;
&lt;p&gt;In this case, it’s recommended to use a &lt;strong&gt;columnar database&lt;/strong&gt;. These are designed specifically for analytical queries — they store data by column rather than by row, which makes operations like filtering and aggregating much faster, especially when only a few fields are queried at a time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Common cloud solutions you’d use here&lt;/strong&gt;:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;Amazon Redshift&lt;/strong&gt;: a solid choice for batch-based analytics at scale&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Google BigQuery&lt;/strong&gt;: serverless, fast, and integrates well with other GCP tools like Dataflow or Looker&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Azure Synapse Analytics&lt;/strong&gt;: great if you’re already on Azure and want hybrid support for structured and semi-structured data&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;If You’re Dealing with Semi-Structured Data (JSON, XML, Logs)&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Use case: In-memory caching:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:1400/format:webp/0*ES_OJVl2RPl-1jgT&quot; alt=&quot;Typical caching workflow&quot;&gt;&lt;/p&gt;
&lt;p&gt;Let’s say you’re storing session data, API tokens, or configuration values that need to be accessed frequently and with extremely low latency. In-memory caching is a classic solution for this, especially when you care more about speed than durability.&lt;/p&gt;
&lt;p&gt;In this case, it’s recommended to use an in-memory key–value store. Tools like &lt;strong&gt;Redis&lt;/strong&gt; or &lt;strong&gt;Memcached&lt;/strong&gt; offer blazing-fast access and simple key-based lookup. They’re also widely supported across cloud providers, &lt;strong&gt;Amazon ElastiCache&lt;/strong&gt; and &lt;strong&gt;Azure Cache for Redis&lt;/strong&gt; make setup and scaling relatively seamless.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use Case: Document-Oriented Data Access:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:1332/format:webp/0*1CFToIC8jH76FGNx.png&quot; alt=&quot;Document oriented data&quot;&gt;&lt;/p&gt;
&lt;p&gt;Consider an application where you’re storing complex, nested user profiles, product catalogs, or blog posts that vary in structure. Each object might have subfields, embedded lists, or optional sections. Trying to normalize this into relational tables would not only be tedious but also reduce flexibility and performance.&lt;/p&gt;
&lt;p&gt;In this case, it’s recommended to use a document database. Systems like &lt;strong&gt;MongoDB Atlas&lt;/strong&gt; or &lt;strong&gt;Amazon DocumentDB&lt;/strong&gt; are purpose-built for storing and querying nested JSON objects. They support indexing on nested fields and let you query or update individual paths inside a document. If you’re already serverless or mobile-heavy, &lt;strong&gt;Firebase Firestore&lt;/strong&gt; or &lt;strong&gt;Azure Cosmos DB&lt;/strong&gt; might be a better fit.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use Case: Relationship-based querying:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:1074/format:webp/0*aGpv8hc90OnlXYD-.png&quot; alt=&quot;Graph data structure&quot;&gt;&lt;/p&gt;
&lt;p&gt;Now let’s imagine you’re working on a system where understanding relationships is key, maybe it’s a social network, a fraud detection tool, or a recommendation engine. Users expect the system to surface meaningful connections across entities: who knows whom, what interacts with what, or how things are related through multiple hops.&lt;/p&gt;
&lt;p&gt;These workloads involve traversals, pattern matching, and recursive relationships, operations that traditional relational databases often struggle to express or optimize.&lt;/p&gt;
&lt;p&gt;In this case, it’s recommended to use a purpose-built graph database. &lt;strong&gt;Neo4j&lt;/strong&gt;, &lt;strong&gt;Amazon Neptune&lt;/strong&gt;, and &lt;strong&gt;Azure Cosmos DB (with Gremlin API)&lt;/strong&gt; are well-suited for modeling and querying complex, interconnected data. They can also power recommendation engines, identity resolution tools, and knowledge graphs with real-time performance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use Case: Keyword-based text search:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:1400/format:webp/0*Br1QD6Y1GZO5Y6d1.png&quot; alt=&quot;inverted index for full text seach&quot;&gt;&lt;/p&gt;
&lt;p&gt;Now let’s imagine you’re working on a search experience, maybe for a news website, an internal tool, or a product catalog. Users expect to find what they need quickly, even if they mistype something or use synonyms.&lt;/p&gt;
&lt;p&gt;These workloads involve fuzzy matching, ranking, tokenization, and stemming — all operations that typical databases don’t handle efficiently.&lt;/p&gt;
&lt;p&gt;In this case, it’s recommended to use a dedicated search engine. &lt;strong&gt;Elasticsearch&lt;/strong&gt;, &lt;strong&gt;Amazon OpenSearch&lt;/strong&gt;, and &lt;strong&gt;Azure Cognitive Search&lt;/strong&gt; are well-suited for full-text indexing. They can also power advanced filtering and faceted search interfaces.&lt;/p&gt;
&lt;h2&gt;&lt;strong&gt;If You’re Dealing with Unstructured Data :&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Use Case: File and Media Storage&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:1400/format:webp/0*sUYRyLkDr_IIEhfI.png&quot; alt=&quot;Blob storage structure in Azure&quot;&gt;&lt;/p&gt;
&lt;p&gt;Suppose you’re building a platform that lets users upload profile photos, download PDFs, or stream video and audio. These files don’t need to be interpreted by a database — they just need to be stored, versioned, and served efficiently.&lt;/p&gt;
&lt;p&gt;Think of a learning platform hosting lecture videos, or an HR system storing CVs and scanned contracts. You’re not querying the contents directly, you just want a reliable way to store and retrieve the files.&lt;/p&gt;
&lt;p&gt;In this case, it’s recommended to use object storage. Services like &lt;strong&gt;Amazon S3&lt;/strong&gt;, &lt;strong&gt;Azure Blob Storage&lt;/strong&gt;, and &lt;strong&gt;Google Cloud Storage&lt;/strong&gt; are optimized for durability, availability, and low-cost archival. They also support metadata tagging, version control, and lifecycle policies. Most modern SaaS platforms rely on object storage behind the scenes to manage files at scale.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use Case: Large-Scale Text Analysis and Embedding&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:1400/format:webp/0*jOo2dydaULCtlU7f&quot; alt=&quot;RAG (retrieval augmented generation) workflow&quot;&gt;&lt;/p&gt;
&lt;p&gt;Now imagine you have thousands of documents — emails, customer reviews, support tickets, or legal contracts — and you want to search, classify, or summarize them. These aren’t nicely structured fields; they’re raw text, often messy and long.&lt;/p&gt;
&lt;p&gt;Let’s say your product team wants to analyze sentiment from open-ended feedback, or legal wants to extract entities from scanned documents. This requires semantic understanding, keyword extraction, and often vector search.&lt;/p&gt;
&lt;p&gt;In this case, it’s recommended to preprocess the data into embeddings and store them in a &lt;strong&gt;vector database&lt;/strong&gt;. Tools like &lt;strong&gt;Pinecone&lt;/strong&gt;, &lt;strong&gt;Weaviate&lt;/strong&gt;, &lt;strong&gt;Qdrant&lt;/strong&gt; support fast similarity search based on meaning rather than keywords. This is the foundation of AI-enhanced search, recommendation engines, and retrieval-augmented generation (RAG) for LLMs.&lt;/p&gt;
&lt;h3&gt;Use Case: Data Lake for Large-Scale Analytics:&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:1298/format:webp/0*4lTzfoD8UJN6Q8SD.png&quot; alt=&quot;Data Lake Architecture&quot;&gt;&lt;/p&gt;
&lt;p&gt;If you’re storing massive volumes of raw, unstructured data — like logs, images, audio files, or telemetry streams — and want the flexibility to analyze it later, you’re in data lake territory.&lt;/p&gt;
&lt;p&gt;This is different from basic object storage. While object stores like S3 or Azure Blob are great for storing files, a data lake layers on metadata, cataloging, and schema-on-read features, so you can query and process that data at scale.&lt;/p&gt;
&lt;p&gt;Think of use cases like a retailer collecting clickstream data, or a utility company storing IoT sensor feeds. In this case, it’s recommended to use a data lake engine such as &lt;strong&gt;Databricks&lt;/strong&gt;, &lt;strong&gt;Delta Lake&lt;/strong&gt;, or &lt;strong&gt;Snowflake&lt;/strong&gt;, especially if you’re planning downstream analytics, ML workflows, or need compliance features like audit logs and fine-grained access control.&lt;/p&gt;
&lt;h2&gt;Summary flowchart:&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:1400/format:webp/1*3x7Z2upFeRiDP8Mdy92BTg.png&quot; alt=&quot;Summary flowchart with example solutions&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Conclusion:&lt;/h2&gt;
&lt;p&gt;There’s no one-size-fits-all storage. It all comes down to how your data looks and how you plan to use it.&lt;/p&gt;
&lt;p&gt;Match your storage to the shape and velocity of your data, and you’ll avoid both overengineering and costly bottlenecks.&lt;/p&gt;
&lt;p&gt;Thank you for reading, and hope this article has been insightful and useful to you.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Caching for Mortals: What You Actually Need to Know]]></title><description><![CDATA[Source: Chatgpt A tasty introduction Imagine you’re building a hot new recipe app that suddenly goes viral because of your revolutionary new…]]></description><link>https://www.tariqmassaoudi.com/caching-guide/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/caching-guide/</guid><pubDate>Mon, 28 Apr 2025 22:12:03 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:1400/format:webp/0*BozZ4kCAoRCW_bDy&quot; alt=&quot;Source: Chatgpt&quot;&gt;&lt;/p&gt;
&lt;h2&gt;&lt;strong&gt;A tasty introduction&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Imagine you’re building a hot new recipe app that suddenly goes viral because of your revolutionary new tagine recipe. Your server is now bombarded with requests from thousands of hungry users desperately seeking the perfect tagine. Your database is sweating, your CPU is screaming “I CANT HANDLE THIS” and your cloud bill is climbing! Your application has become so slow that users have enough time to prepare couscous while waiting for the page to load.&lt;/p&gt;
&lt;p&gt;Sounds familiar? (Maybe not the food part, but the performance crisis might ring a bell)&lt;/p&gt;
&lt;p&gt;This is where caching enters the chat. Caching is like that efficient friend who remembers everyone’s coffee order so the whole group doesn’t have to recite their complicated requests every single time. In the world of computing, it’s a technique that stores frequently accessed data in a temporary location for quicker retrieval, saving your precious resources from doing the same work over and over again.&lt;/p&gt;
&lt;p&gt;In this article, we’ll break down caching concepts into practical, actionable insights. We’ll explore when to use different caching techniques, how to implement them effectively. Whether you’re a junior developer trying to optimize your first production app or a seasoned engineer wanting to refresh your knowledge, this guide will give you the tools to make informed decisions about caching. So let’s dive in and demystify caching for mere mortals!&lt;/p&gt;
&lt;h2&gt;The Why: Benefits of Caching&lt;/h2&gt;
&lt;p&gt;Imagine if every time someone searched for your popular bastilla recipe, your server had to recalculate the preparation time, re-query the database for ingredients, and recompute the nutritional information. This is very inefficient, it’s like a chef forgetting to make pizza after every single customer! Caching aims to solve that.&lt;/p&gt;
&lt;p&gt;Here’s what it brings to the table:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fast response time&lt;/strong&gt;: With cached data, your users get their recipe in milliseconds instead of seconds.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dramatic reduction in server load:&lt;/strong&gt; Your database was previously processing 5,000 identical queries per minute. With caching, that number drops to maybe 50.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Significant cost savings:&lt;/strong&gt; Fewer server resources mean lower cloud bills.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Enhanced user experience:&lt;/strong&gt; Studies show that users abandon websites that take more than 3 seconds to load. Caching helps keep your bounce rate low and your user satisfaction high.&lt;/p&gt;
&lt;h2&gt;Caching Fundamentals: The Building Blocks&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Why it’s effective: Memory vs Disk:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Now that we understand why caching is useful, let’s break down how it works.&lt;/p&gt;
&lt;p&gt;Here’s your app without caching&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:1400/format:webp/1*KFE0uq1pBX4ioEfcdNCp1w.png&quot; alt=&quot;Without caching&quot;&gt;&lt;/p&gt;
&lt;p&gt;Here’s your app with caching enabled&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:1400/format:webp/1*WoB7GH_IsoBKFzB1tDJvWw.png&quot; alt=&quot;With caching&quot;&gt;&lt;/p&gt;
&lt;p&gt;One fundamental point is that caching is faster because memory (RAM) is way faster than Disk (HDD or SSD), but the tradeoff is that RAM is way more expensive&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:1400/format:webp/1*YkYvnky9gjktKCAckmE2dA.png&quot; alt=&quot;In 2023 Memory (RAM) is still 50x more expensive per Terrabyte than SSD&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cache Hit vs. Cache Miss:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Let’s continue explaining how it works,&lt;/p&gt;
&lt;p&gt;When your application looks for data in the cache, one of two things happens:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;Cache Hit&lt;/strong&gt;: “Eureka! Found it!” Your app found what it needed in the cache. This is the equivalent of finding your keys exactly where you left them. The data is served immediately, and everyone’s happy.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cache Miss&lt;/strong&gt;: “Uh-oh, not here.” The data isn’t in the cache, so your app has to take the scenic route to the database, fetch the data, store it in the cache for next time, and then return it and typically populate the cache so that next time we get a cache miss.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cache Eviction: Making Room for New Stuff&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Just like that milk that was perfect yesterday but is questionable today, cached data has a shelf life. Here are some strategies that you can apply when the cache is full.&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;Least Recently Used (LRU)&lt;/strong&gt;: “Haven’t used that recipe in weeks? Out it goes.” Discards the least recently accessed items first when the cache is full.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Least Frequently Used (LFU)&lt;/strong&gt;: “Nobody’s looking at the kale recipes anymore.” Tracks popularity and dumps the least frequently accessed items.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;First In, First Out (FIFO)&lt;/strong&gt;: “Oldest items exit first.” Simple but doesn’t account for item popularity.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cache Invalidation: The Hard Part&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;There’s a famous quote in computer science: “There are only two hard things in Computer Science: cache invalidation and naming things.” Here are some popular strategies along with their use cases:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;Time-Based Expiration (TTL): “If it’s been here too long, it’s probably bad.”&lt;/strong&gt;
You put a timer on your cache entries. Once the clock runs out, they’re tossed. This works great for stuff like API responses or session tokens, where being a little out of date isn’t the end of the world. It’s super simple to set up , just tell the cache how long to keep things. The downside? You might end up serving stale data if something important changes before the timer runs out.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Write-Through Cache: “If it’s important enough to save, it’s important enough to update.”&lt;/strong&gt;
Every time something gets written to your database, it also gets written to the cache right away. This keeps the two perfectly in sync, making it perfect for things like shopping carts or user profiles, where you want instant consistency. The catch is that it slows down your writes, because now you’re hitting two systems at once. Plus, if your cache ever goes down, you’re in for a bad day.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Write-Around Cache: “Save it quietly. We’ll deal with it later.”&lt;/strong&gt;
When you update something, you skip the cache entirely and just hit the database. The cache only gets involved when someone tries to read the data later. This is great for write-heavy systems like logging apps, where most of the stuff written never gets looked at again. It keeps your cache cleaner, but the first read after a write is slower, because the cache has to scramble to catch up.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Write-Back (Write-Behind) Cache: “We’ll get to it… eventually.”&lt;/strong&gt;
Instead of writing to the database right away, you dump the data into the cache first and let the cache figure out when to push it back to the database. This makes writes lightning-fast, which is ideal for things like collecting sensor data or heavy logging. Just be warned , if your cache crashes before syncing back to the database, your data could vanish into the void.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manual Invalidation: “You break it, you clean it.”&lt;/strong&gt;
When your app knows that something has changed, it takes responsibility and manually deletes or updates the cache entry. This is the go-to strategy for precision-demanding systems like content management platforms and real-time dashboards. It guarantees your cache always stays correct, but it also means you need tight, careful code.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Event-Based (Pub-Sub) Invalidation: “Spread the word: it’s outdated!”&lt;/strong&gt;
Instead of manually trying to keep caches updated, you set up a system where any change to the data fires off an event. All the caches that care about that piece of data listen for the event and update themselves accordingly. This keeps things snappy and coordinated across huge distributed systems. Of course, now you have to run and monitor an event system, which can get complicated fast.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Deep dive into the LRU algorithm&lt;/h2&gt;
&lt;p&gt;Next up we will talk about one of the most popular eviction algorithms LRU from an architectural standpoint:&lt;/p&gt;
&lt;p&gt;LRU (Least recently used) is pretty human if you think about it, if you don’t interact with a person for a long time, you tend to forget them. That’s basically how it works, when the cache is full, discard the latest recently accessed item.&lt;/p&gt;
&lt;p&gt;To implement LRU effectively, we need two key operations to be fast:&lt;/p&gt;
&lt;ol class=&quot;list-disc&quot;&gt;
&lt;li&gt;Retrieving an item by its key&lt;/li&gt;
&lt;li&gt;Tracking and updating the “recently used”&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This creates an interesting challenge: Hash tables are great for key-based lookups (O(1) time) but don’t maintain order. Linked lists are perfect for maintaining and modifying order but terrible for lookups.&lt;/p&gt;
&lt;p&gt;The solution? A hybrid approach using both:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;A hash map (dictionary) for O(1) lookups&lt;/li&gt;
&lt;li&gt;A doubly-linked list for tracking access order, the &lt;strong&gt;head&lt;/strong&gt; of the list represents the most recently used item while the &lt;strong&gt;tail&lt;/strong&gt; represents the least recently used item.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:1400/format:webp/0*_THEsXMyrEhWzeyQ.png&quot; alt=&quot;LRU cache data structures&quot;&gt;&lt;/p&gt;
&lt;h3&gt;Process Flow&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Accessing Data&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When a cache entry is accessed, it gets moved to the &lt;strong&gt;most recently used&lt;/strong&gt; position in the list (the head).&lt;/p&gt;
&lt;p&gt;This ensures that the most frequently accessed items stay at the front, and the least frequently used items are pushed to the back.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Adding Data&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When adding a new entry, it’s inserted at the &lt;strong&gt;most recently used&lt;/strong&gt; position (the head).&lt;/p&gt;
&lt;p&gt;If the cache has reached its &lt;strong&gt;capacity&lt;/strong&gt;, the &lt;strong&gt;least recently used&lt;/strong&gt; entry (the tail of the list) is evicted to make space.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:1400/format:webp/0*cRMvpHgkjYwK4tsO.jpeg&quot; alt=&quot;Example process flow&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Cache in the Real World: Redis and Memcached&lt;/h2&gt;
&lt;p&gt;In real production systems, you’re not usually hand-building your cache from scratch. Instead, you lean on powerful, battle-tested tools like &lt;strong&gt;Redis&lt;/strong&gt; or &lt;strong&gt;Memcached&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Both Redis and Memcached are &lt;strong&gt;in-memory key-value stores&lt;/strong&gt; used for caching, but they have slightly different philosophies:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;Memcached&lt;/strong&gt; is a lightweight, pure caching layer. Think: simple key-value, no persistence, no rich data structures.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Redis&lt;/strong&gt; is an in-memory data structure store — it can cache, but it can also persist to disk, replicate data, and even act like a mini-database.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;To sum up, caching reduces load and latency by keeping key data in fast memory instead of repeatedly hitting slower backends. Of course, caching is all about trade‑offs — you gain speed and cost savings at the expense of added complexity, memory use, and potential data staleness. We’ve covered cache hits versus misses, eviction policies (LRU, LFU, FIFO), invalidation methods (TTL, write‑through, pub‑sub), and real‑world tools like Redis and Memcached. Start by caching your heaviest queries with a simple cache‑aside pattern, then measure and refine for optimal performance. Thank you for reading, and hope this article has been insightful and useful to you.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[I Read AI Engineering by Chip Huyen — Here's What Stuck With Me]]></title><description><![CDATA[I’ve been building AI features into production systems for a while now. Like most engineers in this space, I picked things up as I went — a…]]></description><link>https://www.tariqmassaoudi.com/ai-engineering-book-review/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/ai-engineering-book-review/</guid><pubDate>Thu, 20 Feb 2025 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;I’ve been building AI features into production systems for a while now. Like most engineers in this space, I picked things up as I went — a blog post here, a YouTube tutorial there, a lot of trial and error. It worked, but I never had a clear mental model of the full picture. I knew pieces, not the system.&lt;/p&gt;
&lt;p&gt;Then I picked up &lt;strong&gt;AI Engineering: Building Applications with Foundation Models&lt;/strong&gt; by Chip Huyen (O’Reilly, 2025). I wish I had read it a year earlier. Not because it taught me entirely new things — some of it I already knew from experience — but because it organized everything into a framework that finally made sense. It connected the dots between evaluation, prompt engineering, RAG, agents, finetuning, and production architecture in a way no blog post ever did.&lt;/p&gt;
&lt;p&gt;Here are the ideas from the book that changed how I think about building AI applications.&lt;/p&gt;
&lt;h2&gt;AI Engineering Is Not ML Engineering&lt;/h2&gt;
&lt;p&gt;This distinction seems obvious in hindsight, but the book makes it explicit. Traditional ML engineering is about collecting data, training models, and deploying them. You own the entire pipeline from data to weights. AI engineering is different: you’re building on top of foundation models that someone else trained. Your job shifts from model creation to &lt;strong&gt;model adaptation&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In ML engineering, the competitive advantage was in data labeling, feature engineering, and model architecture. In AI engineering, everyone has access to the same models through APIs. The moat is in &lt;strong&gt;context engineering&lt;/strong&gt;, &lt;strong&gt;evaluation on your own use case&lt;/strong&gt;, and &lt;strong&gt;user experience&lt;/strong&gt;. The question becomes: how do I get the best results out of these models for my specific problem?&lt;/p&gt;
&lt;p&gt;Huyen breaks the AI stack into three layers: application development (prompts, context, UX), model development (finetuning, dataset engineering), and infrastructure (serving, compute, monitoring). Most of the work happens at the top layer. You start there and only move down when you need to.&lt;/p&gt;
&lt;h2&gt;Evaluation Is Everything&lt;/h2&gt;
&lt;p&gt;If there’s one theme that runs through the entire book, it’s this: &lt;strong&gt;evaluation is everything&lt;/strong&gt;. And it’s the part most teams get wrong or skip entirely.&lt;/p&gt;
&lt;p&gt;Evaluating traditional software is straightforward — either the function returns the expected output or it doesn’t. Evaluating an LLM’s output is messy. The output is open-ended, subjective, and probabilistic. The model might give a different answer each time.&lt;/p&gt;
&lt;p&gt;The book introduces &lt;strong&gt;Evaluation-Driven Development (EDD)&lt;/strong&gt;, inspired by TDD. The idea is to define your evaluation criteria before you start building. What does a good response look like? What does a bad one look like? Write rubrics, create scoring guidelines, provide examples. If you don’t do this, you’re basically guessing whether your system is getting better or worse with each change.&lt;/p&gt;
&lt;p&gt;In practice there’s a spectrum of evaluation methods. &lt;strong&gt;Functional correctness&lt;/strong&gt; is the gold standard when you can use it — if you’re generating code, run it against test cases. &lt;strong&gt;Similarity to references&lt;/strong&gt; works when you have ground truth, using lexical overlap (BLEU, ROUGE) or semantic similarity via embeddings. &lt;strong&gt;LLM-as-a-judge&lt;/strong&gt; is increasingly popular for subjective evaluation — you use a strong model to score the output of another model. It’s scalable but comes with real limitations: self-bias, position bias, and verbosity bias. Despite those flaws, it’s still useful when combined with other methods.&lt;/p&gt;
&lt;p&gt;The practical takeaway: define your evaluation criteria before you write a single prompt. If you care about something — factuality, tone, format, safety — put an evaluation on it.&lt;/p&gt;
&lt;h2&gt;RAG: Facts vs Form&lt;/h2&gt;
&lt;p&gt;RAG (Retrieval-Augmented Generation) gets its own deep treatment in the book, and rightfully so. The core idea is simple: before the model generates a response, retrieve relevant information and include it in the context.&lt;/p&gt;
&lt;p&gt;Some people think that as context windows grow longer (200K+ tokens now), RAG will become unnecessary. Huyen argues the opposite, and I agree: &lt;strong&gt;data always grows faster than context windows&lt;/strong&gt;. You’ll never fit everything into context, so you’ll always need intelligent retrieval.&lt;/p&gt;
&lt;p&gt;The phrase from the book I use all the time now: &lt;strong&gt;“RAG is for facts, finetuning is for form.”&lt;/strong&gt; If your model needs to know specific, up-to-date information — use RAG. If your model needs to adopt a specific style or behavior pattern — consider finetuning. Most applications need RAG first. Finetuning is expensive, can become outdated when the base model updates, and should only be pursued after you’ve maximized what prompting and RAG can do.&lt;/p&gt;
&lt;h2&gt;Agents Are Powerful but Fragile&lt;/h2&gt;
&lt;p&gt;The agents chapter is where the book gets exciting. At its core, an agent is just an LLM that can perceive its environment and act on it through tools. ChatGPT browsing the web, a coding agent running terminal commands, a customer support bot querying a database — these are all agents.&lt;/p&gt;
&lt;p&gt;A key principle that maps directly to my experience with coding agents: &lt;strong&gt;decouple planning from execution&lt;/strong&gt;. Let the model generate a plan first, validate that plan, then execute it step by step. Blindly letting a model plan and execute simultaneously is how you get agents that go off the rails. This is the same pattern I follow daily — I always ask for a plan first, review it, then let the agent implement.&lt;/p&gt;
&lt;p&gt;Here’s the math that makes this concrete: each step in an agent’s plan is a potential point of failure, and errors compound. A five-step plan where each step has 90% accuracy gives you only about &lt;strong&gt;59% overall success&lt;/strong&gt;. This is why keeping agent plans simple and providing verification at each step matters so much.&lt;/p&gt;
&lt;p&gt;The book also covers multi-agent patterns — routers that classify and delegate queries, sequential chains where each agent processes the previous output, supervisor agents that orchestrate sub-agents, and parallel execution for independent subtasks. These are worth knowing, but the core lesson is simpler: &lt;strong&gt;more steps = more failure points&lt;/strong&gt;. Keep it tight.&lt;/p&gt;
&lt;h2&gt;The Data Flywheel&lt;/h2&gt;
&lt;p&gt;One framework from the book I keep thinking about is around competitive advantage. When building AI products, the barrier to entry is low. If it’s easy for you to build something with an API, it’s easy for anyone else too.&lt;/p&gt;
&lt;p&gt;Huyen identifies three potential moats: technology, data, and distribution. With foundation models commoditizing the technology layer and big companies owning distribution, the most sustainable moat for most teams is &lt;strong&gt;data&lt;/strong&gt;. Specifically, the feedback loop: ship fast, collect user interactions, use that data to improve the product, attract more users, collect more data. This flywheel is what separates products that keep getting better from those that stagnate.&lt;/p&gt;
&lt;p&gt;This means your feedback collection design matters enormously. Explicit feedback (thumbs up/down) is sparse and biased. Implicit feedback (conversation continuation, task completion, abandonment) is noisy but abundant. Designing how you extract signal from user interactions is an underrated skill.&lt;/p&gt;
&lt;h2&gt;What Stuck With Me&lt;/h2&gt;
&lt;p&gt;After reading the book and continuing to build AI features in production, here are the frameworks that stuck:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Evaluation first.&lt;/strong&gt; Before I write prompts, I write evaluation criteria. Before I change a model or a pipeline component, I make sure I can measure whether the change is an improvement.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;RAG before finetuning.&lt;/strong&gt; Every time someone suggests finetuning, I ask: have we exhausted what we can do with better retrieval and better prompts? The answer is almost always no.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Start simple, add progressively.&lt;/strong&gt; Don’t try to build the perfect system from day one. Start with a good prompt and RAG. Evaluate. Then add complexity where the metrics tell you to.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Agents are powerful but fragile.&lt;/strong&gt; The more steps in your agent’s plan, the more points of failure. Decouple planning from execution. Verify at each step.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Context engineering is the skill.&lt;/strong&gt; Not prompting. Context engineering. That includes what information you retrieve, how you structure it, what goes at the beginning vs. the middle, and how much you include. This is where the craft is.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you’re building anything with foundation models, this book is the best single resource I’ve found. The specific tools and models will change, but the principles are durable. Read it, then re-read the evaluation chapters, then go build your eval pipeline.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[I Use Coding Agents Daily: Here's What Works]]></title><description><![CDATA[Introduction: Agentic coding is less about “letting the AI code” and more about how you set it up for success. Treat coding agents like…]]></description><link>https://www.tariqmassaoudi.com/i-use-coding-agents-daily/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/i-use-coding-agents-daily/</guid><pubDate>Wed, 22 Jan 2025 12:00:00 GMT</pubDate><content:encoded>&lt;h3&gt;Introduction:&lt;/h3&gt;
&lt;p&gt;Agentic coding is less about “letting the AI code” and more about how you set it up for success. Treat coding agents like &lt;strong&gt;junior engineers&lt;/strong&gt;: give them clear goals, strong constraints, the right tools, and a way to validate their work. This article summarizes practical lessons and patterns that have worked for me when using modern agentic coding tools in real projects.&lt;/p&gt;
&lt;h3&gt;Variables that affect the quality of the output:&lt;/h3&gt;
&lt;p&gt;When you’re interacting with a modern coding agent, you’re can choose the &lt;strong&gt;underlying model&lt;/strong&gt;, the &lt;strong&gt;content of the message&lt;/strong&gt; you send the agent which is the initial context, and the &lt;strong&gt;tools&lt;/strong&gt; provided to the agent, each of these variables are important to the output.&lt;/p&gt;
&lt;h3&gt;Put the most effort in planning:&lt;/h3&gt;
&lt;p&gt;For most agentic tasks with the exception of trivial and very clear bug fixes or documentation, I’d recommend to spend the most time and effort on crafting a clear plan for the agent before implementation, most agenools offer a plan mode that you can use. For complex problems that I’m not sure about the right solution, I like to start by an exploratory or brainstorming prompt, it’s important to give agent a clear path to the solution when possible so that the agent doesn’t guess. Here’s an example of an exploratory prompt structure I like to use:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;As a (domain expert)
Given this problem:
(your problem)
propose multiple solutions that respect (your best practices or constraints here)
Rank these solutions while providing detailed reasoning and tradeoffs.
Recommend the best solution
(tag the relevant files or folders here)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h3&gt;Context is king:&lt;/h3&gt;
&lt;p&gt;Just like humans, coding agents perform best when they have the right information, and they get less smart the more their context fill up, modern current models have around 200K window. Reference &lt;a href=&quot;https://research.trychroma.com/context-rot&quot;&gt;study&lt;/a&gt; by chroma&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/1600/0*8_lA6qnrmo8nAnT7.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Context engineering is a very important skill to get the best of coding agents, here’s some tips and what worked for me:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;One task, one session&lt;/strong&gt;, after each task start a new chat.&lt;/li&gt;
&lt;li&gt;If the feature is too big, ask the agent to &lt;strong&gt;split the plan into phases&lt;/strong&gt;, execute each phase in a new session, verify the output of the phase then move to the next phase.&lt;/li&gt;
&lt;li&gt;When running out of context ask the agent to create a &lt;strong&gt;handover markdown document&lt;/strong&gt;, with work done and learnings, pass it to another agent to continue the work.&lt;/li&gt;
&lt;li&gt;When possible provide the agent with the &lt;strong&gt;exact files relevant to the task&lt;/strong&gt; to prevent that the agent explores the codebase wasting time and tokens.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Give the agent a way to verify it’s work:&lt;/h3&gt;
&lt;p&gt;Without a way to verify it’s work the agent is basically guessing, it might one shot your task or you might have to verify it’s work manually and iterate with it. If you give it a deterministic way to verify the work, it will guess verify and if wrong rethink the approach until the task is correct.&lt;/p&gt;
&lt;p&gt;In practice ask the agent to &lt;strong&gt;write tests&lt;/strong&gt; and verify the code against them, for backend work I found asking the agent to run the backend server and test the endpoint live to be effective.&lt;/p&gt;
&lt;p&gt;Frontend tasks are more complex to verify, you can use playwright MCP or Claude Chrome extension, but it might be unreliable, the next best thing is to ask the agent to add debug logs and copy it back to the agent if something goes wrong.&lt;/p&gt;
&lt;h3&gt;The right model for the right task:&lt;/h3&gt;
&lt;p&gt;For planning, I always use the current best model which is Claude Opus 4.5, some people have had success with GPT 5.2, for executing the plan using the next tier of models such as Claude Sonnet is often enough, as long as the plan is detailed enough. For simple tasks such as committing, writing pull requests you can choose the smallest fastest model for example Claude Haiku or Gemini Flash.&lt;/p&gt;
&lt;h3&gt;When to use MCP:&lt;/h3&gt;
&lt;p&gt;The drawback of using MCPs is the &lt;strong&gt;context cost&lt;/strong&gt; since they store the tool descriptions in context and you have to remember to disable the MCP server after use. If the service you’re interacting with provides a CLI tool that accomplishes same task as MCP (an example here is github cli, azure cli) just ask the model to &lt;strong&gt;use the CLI instead&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;Slash commands:&lt;/h3&gt;
&lt;p&gt;Slash commands are shortcut prompts for common tasks, they’re extremely useful, I mainly use it for committing, pushing and creating pull requests. Example command to commit and push:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;1. First, run git diff to see all changes (both staged and unstaged)
2. Analyze the diff to understand what changed
3. Write a conventional commit message based on the diff:
   - Use format: type(scope): description
   - Types: feat, fix, docs, style, refactor, test, chore
   - Keep the first line under 72 characters
   - Add a blank line and bullet points for details if needed
4. Stage all changes with git add -A
5. Commit with the conventional commit message
6. Push to the remote branch. If the branch has no upstream, set it with
   git push -u origin &amp;lt;branch&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h3&gt;Global rules files&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;claude.md&lt;/strong&gt;, &lt;strong&gt;cursor rules&lt;/strong&gt; are a must have to establish your dos and don’t, coding style, etc .., and can be helpful to constrain the agent but they’re not hard rules, expect the agent to ignore them sometimes. Here’s a &lt;a href=&quot;https://cursor.directory/rules&quot;&gt;resource&lt;/a&gt; to find common rules for your stack.&lt;/p&gt;
&lt;p&gt;You must &lt;strong&gt;review the agent output manually&lt;/strong&gt;. Another helpful pattern is to have another agent that you provide with your quality metrics review the output of the first agent this will help you quickly find any red flags.&lt;/p&gt;
&lt;h3&gt;Conclusion:&lt;/h3&gt;
&lt;p&gt;Agentic coding works when you treat it like managing a junior dev: clear tasks, good context, and proper verification. The fundamentals won’t change as tools evolve, &lt;strong&gt;planning matters more than prompting&lt;/strong&gt;, &lt;strong&gt;context engineering beats brute force&lt;/strong&gt;, and &lt;strong&gt;review is non-negotiable&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Start small, build your own patterns, and remember: you’re still the engineer. The agent just moves faster than you type.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[From Hacky Scripts to Professional Code: A Guide to Crafting High-Quality Python Projects]]></title><description><![CDATA[Introduction: Imagine it’s late at night and you’re working on a python script that just has to work. It started off as a simple idea, just…]]></description><link>https://www.tariqmassaoudi.com/hacky-scripts/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/hacky-scripts/</guid><pubDate>Fri, 04 Oct 2024 22:12:03 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/1*VHR5EQANK5eg2K_NP0PRfQ.jpeg&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Introduction:&lt;/h2&gt;
&lt;p&gt;Imagine it’s late at night and you’re working on a python script that just has to work. It started off as a simple idea, just automate this one thing, scrape this piece of data and you’re done! As you intuitivly add more features, a few extra lines turns into a hundred and before you know it the script has grown into an unmanagable mess ! A tangled mess of dependencies, random formating and a small change that risks to break everything else.&lt;/p&gt;
&lt;p&gt;Sound familiar?&lt;/p&gt;
&lt;p&gt;This senario plays out for developpers accros the world, whether they’re just starting out with Python or juggling multiple projects that envolved without proper structure. Thus the need to start out your project right to create something maintainable, sharable and scalable that others could easily work on !&lt;/p&gt;
&lt;p&gt;In this article, we’re going to explore how a few key tools and approaches can elevate your Python projects to a professional standard:  &lt;strong&gt;automatic code formatting&lt;/strong&gt;  with Black,  &lt;strong&gt;code linting&lt;/strong&gt;  to ensure quality,  &lt;strong&gt;dependency management&lt;/strong&gt;  using Poetry, and the  &lt;strong&gt;power of Makefiles&lt;/strong&gt;  to simplify everyday tasks.&lt;/p&gt;
&lt;h2&gt;How to Make the Best Use of This Article 📋&lt;/h2&gt;
&lt;p&gt;This of this article as  &lt;strong&gt;checklist&lt;/strong&gt;  for improving your Python projects, covering everything from dependency management to automated testing.&lt;/p&gt;
&lt;p&gt;The article provides external ressources to dive deeper into each tool or topic.&lt;/p&gt;
&lt;h3&gt;What It Is:&lt;/h3&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;A practical guide for leveling up your Python projects.&lt;/li&gt;
&lt;li&gt;A starting point for tools that streamline development.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;What It’s Not:&lt;/h3&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;A deep dive into each tool’s advanced features.&lt;/li&gt;
&lt;li&gt;A one-size-fits-all solution you don’t need every tool! please adapt it to your needs!&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Dependency Management Made Easy with Poetry 🛠️&lt;/h2&gt;
&lt;p&gt;If you’ve ever worked with  &lt;code class=&quot;language-text&quot;&gt;pip&lt;/code&gt;  and  &lt;code class=&quot;language-text&quot;&gt;requirements.txt&lt;/code&gt;, you’ve likely run into issues like version conflicts, missing packages, or struggles to replicate environments. Poetry solves these problems by maintaining a  &lt;strong&gt;single source of truth&lt;/strong&gt;  for your project’s dependencies using the  &lt;code class=&quot;language-text&quot;&gt;pyproject.toml&lt;/code&gt;  file, making it easier to:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Install dependencies consistently across machines.&lt;/li&gt;
&lt;li&gt;Manage both development and production dependencies.&lt;/li&gt;
&lt;li&gt;Keep your project reproducible by pinning exact versions.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Getting Started with Poetry&lt;/h3&gt;
&lt;p&gt;Install Poetry&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;bash&quot;&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;&lt;span class=&quot;token function&quot;&gt;curl&lt;/span&gt; -sSL https://install.python-poetry.org &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; python3 -&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Initialize Your Project&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;bash&quot;&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;poetry init&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This command walks you through setting up your  &lt;code class=&quot;language-text&quot;&gt;pyproject.toml&lt;/code&gt;, where all your dependencies are stored.&lt;/p&gt;
&lt;p&gt;Add Dependencies:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;bash&quot;&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;poetry &lt;span class=&quot;token function&quot;&gt;add&lt;/span&gt; fastapi&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This installs FastAPI and updates your  &lt;code class=&quot;language-text&quot;&gt;pyproject.toml&lt;/code&gt;  and  &lt;code class=&quot;language-text&quot;&gt;poetry.lock&lt;/code&gt;. For development dependencies like linters or testing tools, use:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;bash&quot;&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;poetry &lt;span class=&quot;token function&quot;&gt;add&lt;/span&gt; --dev black&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You can install all dependencies of a particular project using:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;bash&quot;&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;poetry &lt;span class=&quot;token function&quot;&gt;install&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This installs everything in  &lt;code class=&quot;language-text&quot;&gt;poetry.lock&lt;/code&gt;&lt;/p&gt;
&lt;h2&gt;Virtual Environments: The Power of Isolation 🐍&lt;/h2&gt;
&lt;p&gt;If you’ve ever juggled multiple Python projects, each requiring different libraries or even different versions of Python. You’ve probably ran into issues with dependency conflicts or global installations breaking!&lt;/p&gt;
&lt;p&gt;This is where  &lt;strong&gt;virtual environments&lt;/strong&gt;  become a developer’s best friend — they allow each project to have its own isolated setup, free from the chaos of conflicting versions.&lt;/p&gt;
&lt;h3&gt;Pyenv: A Solution for Managing Multiple Python Versions&lt;/h3&gt;
&lt;p&gt;Pyenv allows you to install and switch between different Python versions effortlessly, right from your terminal.&lt;/p&gt;
&lt;h3&gt;Example Scenario with Pyenv:&lt;/h3&gt;
&lt;p&gt;Imagine you’re working on a new project that needs Python 3.10 for its features, but you have another project stuck on Python 3.8. Let’s solve this issue with Pyenv&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Install Pyenv&lt;/strong&gt;: First, install Pyenv with a simple command:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;bash&quot;&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;&lt;span class=&quot;token function&quot;&gt;curl&lt;/span&gt; https://pyenv.run &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;bash&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Install Multiple Python Versions&lt;/strong&gt;: Use Pyenv to install Python 3.8 and Python 3.10&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;bash&quot;&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;pyenv &lt;span class=&quot;token function&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;3.10&lt;/span&gt;.0  
pyenv &lt;span class=&quot;token function&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;3.8&lt;/span&gt;.10&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Switching Between Versions&lt;/strong&gt;: To set Python 3.10 globally, run:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;bash&quot;&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;pyenv global &lt;span class=&quot;token number&quot;&gt;3.10&lt;/span&gt;.0&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2&gt;Formatters, Linters and Beyond 🧼&lt;/h2&gt;
&lt;p&gt;Ensuring code quality is one of the most critical steps in building a professional-grade Python project. Formatters and linters and type checkers automate this process, helping you maintain consistency, catch bugs, and enforce best practices. In this section, we’ll explore four essential tools to help with this:  &lt;strong&gt;Black&lt;/strong&gt;,  &lt;strong&gt;Flake8&lt;/strong&gt;,  &lt;strong&gt;isort&lt;/strong&gt;, and  &lt;strong&gt;Mypy&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;Black for Code Formatting 🖤&lt;/h3&gt;
&lt;p&gt;Black is an opinionated code formatter that takes care of all the stylistic choices in your code. Instead of wasting time debating code styles or manually reformatting code, Black automatically does that for you! With just a single command, your Python code gets a uniform look, making it easier to read and maintain.&lt;/p&gt;
&lt;p&gt;For example, here’s a before and after comparison of code formatted by Black:&lt;/p&gt;
&lt;p&gt;Before:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;add_numbers&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;a&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;b&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; a&lt;span class=&quot;token operator&quot;&gt;+&lt;/span&gt;b&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;After:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;add_numbers&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;a&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; b&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; a &lt;span class=&quot;token operator&quot;&gt;+&lt;/span&gt; b&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Black follows the PEP 8 style guidelines for python, refer to the guide  &lt;a href=&quot;https://peps.python.org/pep-0008/&quot;&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;you can use Black after installing it with pip from the command line:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;bash&quot;&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;black folder_needs_fomatting&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You can also install it into VS code and set the editor to apply black whenever you save a python file which is the most convenient method.&lt;/p&gt;
&lt;p&gt;check this  &lt;a href=&quot;https://marcobelo.medium.com/setting-up-python-black-on-visual-studio-code-5318eba4cd00&quot;&gt;guide&lt;/a&gt;  for instructions.&lt;/p&gt;
&lt;p&gt;Fine the black documentation  &lt;a href=&quot;https://black.readthedocs.io/en/stable/&quot;&gt;here&lt;/a&gt;. Alternatives to black include  &lt;strong&gt;YAPF&lt;/strong&gt;  (Yet Another Python Formatter),  &lt;strong&gt;Autopep8&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;Linting with Flake8 🔍&lt;/h3&gt;
&lt;p&gt;While Black focuses on formatting,  &lt;strong&gt;Flake8&lt;/strong&gt;  takes care of code quality by detecting common issues such as unused imports, undefined variables, and style violations. It helps you identify potential bugs early, making your code cleaner and more reliable.&lt;/p&gt;
&lt;p&gt;For example, Flake8 might flag the following code:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;calculate_total&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
 &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; total &lt;span class=&quot;token comment&quot;&gt;# undefined variable&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Flake8 would catch that  &lt;code class=&quot;language-text&quot;&gt;total&lt;/code&gt;  is used before being defined, preventing a runtime error later.&lt;/p&gt;
&lt;p&gt;It is also advisable to set it up with VS code. Check this  &lt;a href=&quot;https://dev.to/mingming-ma/python-black-and-flake8-configuration-in-vs-code-as-of-november-3-2023-13ag&quot;&gt;guide&lt;/a&gt;  for instructions.&lt;/p&gt;
&lt;p&gt;Alternatives to flake8 include  &lt;strong&gt;Pylint.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Sorting Imports with isort 📦&lt;/h2&gt;
&lt;p&gt;In larger projects, keeping your imports organized is crucial for readability and maintainability. This is where  &lt;strong&gt;isort&lt;/strong&gt;  comes in.  &lt;strong&gt;isort&lt;/strong&gt;  is a tool that automatically sorts your imports, grouping them into logical sections and ensuring that they are in the correct order.&lt;/p&gt;
&lt;h2&gt;Get  Tariq Massaoudi’s stories in your inbox&lt;/h2&gt;
&lt;p&gt;Join Medium for free to get updates from this writer.&lt;/p&gt;
&lt;p&gt;Subscribe&lt;/p&gt;
&lt;p&gt;Before isort:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; os  
&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; requests  
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; django&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;shortcuts &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; render  
&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; sys  
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;models &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; Product  
&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; json&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;After isort:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; json  
&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; os  
&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; sys  
  
&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; requests  
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; django&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;shortcuts &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; render  
  
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;models &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; Product&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;With isort, standard library imports, third-party dependencies, and local application imports are neatly separated, following Python’s best practices.&lt;/p&gt;
&lt;h2&gt;Type Checking with Mypy 🧠&lt;/h2&gt;
&lt;p&gt;In addition to formatters and linters,  &lt;strong&gt;Mypy&lt;/strong&gt;  adds static type checking to your Python code. Mypy helps you catch type-related bugs before they even occur by checking the types of variables, function arguments, and return values against the expected types.&lt;/p&gt;
&lt;p&gt;For instance, Mypy would catch the following type mismatch:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;add_numbers&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;a&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; b&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;&gt;&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; a &lt;span class=&quot;token operator&quot;&gt;+&lt;/span&gt; b  
  
add_numbers&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;1&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;token comment&quot;&gt;# Mypy will flag this!&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;For seamless development, you can also configure Mypy with VS code&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learn more&lt;/strong&gt;  in the  &lt;a href=&quot;http://mypy-lang.org/&quot;&gt;Mypy documentation&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Introduction to Software Testing with Pytest 🧪&lt;/h2&gt;
&lt;p&gt;How can you be sure that your code does what it’s supposed to — and keeps working even as you add new features or make changes? This is where  &lt;strong&gt;software testing&lt;/strong&gt;  becomes essential. Testing not only confirms that your code works right now, but also gives you the confidence that it will keep working and not break as your project evolves.&lt;/p&gt;
&lt;h3&gt;Writing a Simple Test&lt;/h3&gt;
&lt;p&gt;Suppose you have a function that adds two numbers:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;add_numbers&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;a&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; b&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; a &lt;span class=&quot;token operator&quot;&gt;+&lt;/span&gt; b&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now, let’s write a test for it using Pytest:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;test_add_numbers&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
    &lt;span class=&quot;token keyword&quot;&gt;assert&lt;/span&gt; add_numbers&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;5&lt;/span&gt;  
    &lt;span class=&quot;token keyword&quot;&gt;assert&lt;/span&gt; add_numbers&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;0&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;To run the test, just execute  &lt;code class=&quot;language-text&quot;&gt;pytest&lt;/code&gt;  in your terminal, and Pytest will find and run all your test cases automatically.&lt;/p&gt;
&lt;h3&gt;Beyond Basics: Advanced Testing Topics&lt;/h3&gt;
&lt;p&gt;Once you’re comfortable with basic testing, Pytest offers advanced tools to take your testing to the next level:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;Test Coverage&lt;/strong&gt;: Ensure that all parts of your code are being tested by measuring  &lt;strong&gt;test coverage&lt;/strong&gt;. Tools like  &lt;code class=&quot;language-text&quot;&gt;pytest-cov&lt;/code&gt;  help you identify untested parts of your project.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parameterized Tests&lt;/strong&gt;: Run the same test with multiple inputs to catch edge cases without repeating code.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fixtures&lt;/strong&gt;: Simplify complex test setups by using  &lt;strong&gt;fixtures&lt;/strong&gt;  to manage dependencies, like database connections or file structures.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These tools can make your tests more efficient and thorough, ensuring your code is rock-solid and ready for anything. For more on these advanced features, check out the  &lt;a href=&quot;https://www.google.com/search?q=pytest+documentation&amp;#x26;oq=pytest+do&amp;#x26;gs_lcrp=EgZjaHJvbWUqBwgBEAAYgAQyBggAEEUYOTIHCAEQABiABDIHCAIQABiABDIHCAMQABiABDIHCAQQABiABDIHCAUQABiABDIHCAYQABiABDIHCAcQABiABDIHCAgQABiABDIHCAkQABiABNIBCDI0OTdqMGo3qAIAsAIA&amp;#x26;sourceid=chrome&amp;#x26;ie=UTF-8&quot;&gt;Pytest documentation&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;The Power of Makefiles: Automating Your Workflow ⚙️&lt;/h2&gt;
&lt;p&gt;As your Python projects grow, you’ll notice a pattern: running the same commands repeatedly, whether it’s for testing, linting, formatting, or even just launching your application. Manually typing out these commands each time can become tedious.&lt;/p&gt;
&lt;p&gt;Makefiles allow you to define a series of commands in a file (&lt;code class=&quot;language-text&quot;&gt;Makefile&lt;/code&gt;), which can then be executed with a single, memorable command:  &lt;code class=&quot;language-text&quot;&gt;make&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;The Structure of a Makefile&lt;/h3&gt;
&lt;p&gt;A Makefile consists of  &lt;strong&gt;rules&lt;/strong&gt;, which are written in the format:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;makefile&quot;&gt;&lt;pre class=&quot;language-makefile&quot;&gt;&lt;code class=&quot;language-makefile&quot;&gt;&lt;span class=&quot;token target symbol&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; dependencies  
    command&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;Target&lt;/strong&gt;: This is the name of the task you want to run. It can be anything you choose, like  &lt;code class=&quot;language-text&quot;&gt;format&lt;/code&gt;,  &lt;code class=&quot;language-text&quot;&gt;test&lt;/code&gt;, or  &lt;code class=&quot;language-text&quot;&gt;build&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dependencies&lt;/strong&gt;: These are files or targets that must be up-to-date before the current target runs. While they are more commonly used in software compilation, in Python projects, we don’t usually use them unless specific files must be checked before a command runs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Command&lt;/strong&gt;: This is the shell command to execute when the target is called. Commands must be indented with a  &lt;strong&gt;tab&lt;/strong&gt;, which is a common source of errors when writing Makefiles.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Makefile through an example&lt;/h3&gt;
&lt;p&gt;Let’s walk through an example. Suppose your project frequently requires the following tasks:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Formatting your code with  &lt;strong&gt;Black&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Linting your code with  &lt;strong&gt;Flake8&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Running tests with  &lt;strong&gt;Pytest&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Create a File Named&lt;/strong&gt; &lt;code class=&quot;language-text&quot;&gt;**Makefile**&lt;/code&gt;  in the root directory of your project. It should have no extension.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;makefile&quot;&gt;&lt;pre class=&quot;language-makefile&quot;&gt;&lt;code class=&quot;language-makefile&quot;&gt;&lt;span class=&quot;token target symbol&quot;&gt;all&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; format lint test  
  
&lt;span class=&quot;token target symbol&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
    black .  
  
&lt;span class=&quot;token target symbol&quot;&gt;lint&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
    flake8 .  
  
&lt;span class=&quot;token target symbol&quot;&gt;test&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
    pytest&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here, the  &lt;code class=&quot;language-text&quot;&gt;all&lt;/code&gt;  target runs  &lt;code class=&quot;language-text&quot;&gt;format&lt;/code&gt;,  &lt;code class=&quot;language-text&quot;&gt;lint&lt;/code&gt;, and  &lt;code class=&quot;language-text&quot;&gt;test&lt;/code&gt;  in that order. When you type  &lt;code class=&quot;language-text&quot;&gt;make all&lt;/code&gt;, all three tasks are executed.&lt;/p&gt;
&lt;p&gt;For a more in-depth guide check this  &lt;a href=&quot;https://medium.com/aigent/makefiles-for-python-and-beyond-5cf28349bf05&quot;&gt;article&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;CI/CD: Automate Testing, Formatting, and Code Quality 🚀&lt;/h2&gt;
&lt;p&gt;With your code formatted, tested, and linted, how can you ensure that every change is consistently checked before merging into your project? That’s where  &lt;strong&gt;Continuous Integration (CI)&lt;/strong&gt;  and  &lt;strong&gt;Continuous Deployment (CD)&lt;/strong&gt;  come in.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Continuous Integration (CI)&lt;/strong&gt;: Every time you or your team pushes new code, CI automatically runs your tests, linting, and formatting checks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Continuous Deployment (CD)&lt;/strong&gt;: Once your code passes all the CI checks, CD takes over by deploying it automatically to your production or staging environment.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;CI/CD&lt;/strong&gt;  ensuring every code change is consistently verified before merging. This prevents bugs and keeps your project clean.&lt;/p&gt;
&lt;h3&gt;Example: CI Pipeline with GitHub Actions 🛠️&lt;/h3&gt;
&lt;p&gt;Create a  &lt;code class=&quot;language-text&quot;&gt;.github/workflows/ci.yml&lt;/code&gt;  file in your project and add the following configuration:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;yaml&quot;&gt;&lt;pre class=&quot;language-yaml&quot;&gt;&lt;code class=&quot;language-yaml&quot;&gt;&lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; CI Pipeline  
  
&lt;span class=&quot;token key atrule&quot;&gt;on&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
  &lt;span class=&quot;token key atrule&quot;&gt;push&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
    &lt;span class=&quot;token key atrule&quot;&gt;branches&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; main  
  
&lt;span class=&quot;token key atrule&quot;&gt;jobs&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
  &lt;span class=&quot;token key atrule&quot;&gt;test&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
    &lt;span class=&quot;token key atrule&quot;&gt;runs-on&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; ubuntu&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;latest  
    &lt;span class=&quot;token key atrule&quot;&gt;steps&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;uses&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; actions/checkout@v2  
      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;uses&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; actions/setup&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;python@v2  
        &lt;span class=&quot;token key atrule&quot;&gt;with&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
          &lt;span class=&quot;token key atrule&quot;&gt;python-version&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;3.x&apos;&lt;/span&gt;  
      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; pip install poetry &lt;span class=&quot;token important&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; poetry install  
      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; poetry run black &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;check .  
      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; poetry run flake8  
      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; poetry run pytest&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This pipeline runs  &lt;strong&gt;Black&lt;/strong&gt;,  &lt;strong&gt;Flake8&lt;/strong&gt;, and  &lt;strong&gt;Pytest&lt;/strong&gt;  on each push to  &lt;code class=&quot;language-text&quot;&gt;main&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;For more, check out  &lt;a href=&quot;https://docs.github.com/en/actions&quot;&gt;GitHub Actions docs&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Refactoring and Clean Code Practices: Beyond Automation 🧹&lt;/h2&gt;
&lt;p&gt;While tools like  &lt;strong&gt;Black&lt;/strong&gt;  and  &lt;strong&gt;Flake8&lt;/strong&gt;  help you automate formatting and linting, automation can only take you so far. Clean, maintainable code isn’t just about fixing syntax issues , it’s about writing code that humans can understand and improve over time.&lt;/p&gt;
&lt;h3&gt;Refactoring in Action&lt;/h3&gt;
&lt;p&gt;Let’s say you have a function that works but could be cleaner:&lt;/p&gt;
&lt;p&gt;Before:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;process_data&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;data&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
    result &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;  
    &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; item &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; data&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
        &lt;span class=&quot;token keyword&quot;&gt;if&lt;/span&gt; item&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;age&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;&gt;&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;18&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
            result&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;append&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;item&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;name&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;upper&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; result&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;After Refactoring&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;ADULT_AGE &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;18&lt;/span&gt;  
  
&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;is_adult&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;person&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; person&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;age&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;&gt;&lt;/span&gt; ADULT_AGE  
  
&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;get_name_uppercase&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;person&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; person&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;name&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;upper&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
  
&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;process_data&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;data&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  
    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;get_name_uppercase&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;person&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; person &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; data &lt;span class=&quot;token keyword&quot;&gt;if&lt;/span&gt; is_adult&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;person&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The code is now split into small, meaningful functions with clear names.&lt;/p&gt;
&lt;p&gt;For more tips on refactoring, check out this  &lt;a href=&quot;https://refactoring.guru/&quot;&gt;refactoring guide&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Here are some key clean code practices:&lt;/h3&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;Keep Functions Small&lt;/strong&gt;: Break your code into bite-sized, single-purpose functions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Descriptive Names&lt;/strong&gt;: Good names make code self-explanatory, reducing the need for comments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Avoid Repetition&lt;/strong&gt;: Stick to the  &lt;strong&gt;DRY&lt;/strong&gt;  (Don’t Repeat Yourself) principle refactor duplicate code into reusable functions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When you combine good refactoring with clean code principles, your projects become easier to maintain and scale. To dive deeper, explore this  &lt;a href=&quot;https://clean-code-developer.com/&quot;&gt;&lt;strong&gt;guide to writing clean code&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;To sum up, by adopting these tools and practices, you can transform your Python projects into clean, maintainable, and professional-grade. Whether it’s managing dependencies with Poetry or automating tests with CI/CD, each step saves you time and headaches in the long run!&lt;/p&gt;
&lt;p&gt;Thanks for reading, and I hope this guide helps you on your journey to building better Python projects! Feel free to reach out on  &lt;a href=&quot;https://www.linkedin.com/in/tariqmassaoudi/&quot;&gt;LinkedIn&lt;/a&gt;  if you have any questions or want to chat more.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Unraveling the Mysteries of the Mind: A Journey Through 20 Psychological Principles]]></title><description><![CDATA[Introduction: Imagine you’ve just binge-watched an enthralling new TV show. The characters, the plot twists, the dialogues — they’re all…]]></description><link>https://www.tariqmassaoudi.com/psychology-principles/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/psychology-principles/</guid><pubDate>Mon, 04 Dec 2023 22:40:32 GMT</pubDate><content:encoded>&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/0*IA_uIzZvk5ZbcM61&quot; alt=&quot;Endowment Effect&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h1&gt;&lt;strong&gt;Introduction:&lt;/strong&gt;&lt;/h1&gt;
&lt;p&gt;Imagine you’ve just binge-watched an enthralling new TV show. The characters, the plot twists, the dialogues — they’re all fresh in your mind. Then, as if by some twist of fate, you start noticing references to this show everywhere — in conversations, on social media, even in casual remarks from your colleagues. Is this mere chance, or is there something more to this pattern? Dive with us into the fascinating realm of psychological principles and uncover how they subtly influence our perceptions and daily experiences.&lt;/p&gt;
&lt;h2&gt;&lt;strong&gt;1. The Baader-Meinhof Phenomenon: The Illusion of Frequency&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Ever mentioned a quirky, seemingly rare vintage car and then spotted it everywhere? That’s the Baader-Meinhof Phenomenon in action. It’s like our brain, the ultimate pattern-recognition machine, suddenly puts a spotlight on what was always there. It’s a quirky reminder of how our perception can paint a skewed picture of reality. Next time this happens, take a beat to think: where else might my brain be playing this trick on me?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application&lt;/strong&gt;: When you notice this phenomenon, pause and consider other areas in life where your perception might be creating a false narrative of frequency or importance.&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/1*t_sGI9MtXObaylkvCDxHLg.png&quot; alt=&quot;Baader-Meinhof Phenomenon&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;2. The Dunning-Kruger Effect: The Peak of Mt. Stupid&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Remember when you first tried cooking a complex dish and thought, ‘Hey, I’m pretty good at this’? Only to realize later that your masterpiece barely scratched the surface? Welcome to the Dunning-Kruger Effect — a humbling journey from the ‘peak of Mt. Stupid’ to the valleys of ‘I have so much to learn.’ It’s a nudge to keep learning, to never stop evolving.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application&lt;/strong&gt;: Recognize when you might be on the “peak of Mt. Stupid” and actively seek feedback and knowledge to climb towards true expertise.&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/0*IhE8iXir8OcYokTw.png&quot; alt=&quot;Dunning-Kruger Effect&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;3. The Peter Principle: Rising to the Level of Incompetence&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Consider Alex, a top-performing sales associate in a retail company. His exceptional sales record led to a promotion to sales manager. However, managing a team, unlike closing sales deals, wasn’t his forte. Alex’s struggle in his new role is a textbook example of the Peter Principle: excelling in one position doesn’t guarantee competence in a higher role.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application&lt;/strong&gt;: Assess your own career path. Are you equipped for your current role, or is there a skill gap you need to address?&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/0*LTWvVzVCWjhw65M9.jpg&quot; alt=&quot;Peter Principle&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;4. Anchoring Effect: The First Number Sticks&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;In negotiations, the first number thrown out often becomes an invisible anchor, influencing all that follows. Think about the last time you haggled for a car or negotiated your salary. The initial figure sets the stage, impacting the entire negotiation dance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application&lt;/strong&gt;: Be mindful of initial figures in negotiations — whether you’re buying a car or discussing a raise. Set your anchors wisely!&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/0*Au7sT8-LFhI_80lu&quot; alt=&quot;Anchoring Effect&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;5. The Cobra Effect: Good Intentions, Unintended Consequences&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;When the British government in colonial India offered a bounty for cobras, it led to people breeding cobras instead of reducing their population. This is the Cobra Effect, where solutions can sometimes create more problems.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application&lt;/strong&gt;: Think through the potential unintended consequences before implementing a solution. Look for a holistic understanding rather than quick fixes.&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:874/0*cgppwkdi4UDy649C&quot; alt=&quot;Cobra Effect&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;6. Amara’s Law: Misjudging Technology’s Impact&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Consider the rise of social media platforms like Facebook or Instagram. Initially, many viewed them as simple online spaces for sharing photos and catching up with friends. However, over time, their long-term impact has been profound, reshaping how we communicate, influencing global politics, and even affecting mental health. This illustrates Amara’s Law: in our tech-driven world, we often overestimate the short-term effects of new technologies while vastly underestimating their long-term implications. This principle is particularly important for businesses and individuals trying to navigate the ever-evolving landscape of the digital age.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application&lt;/strong&gt;: Balance your expectations when evaluating new technology. Consider long-term implications, not just immediate benefits.&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/0*M0Y0CJVIubi5LypD.png&quot; alt=&quot;Amara&apos;s Law&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;7. The Law of Least Effort: Path of Minimum Resistance&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Consider the popularity of ride-sharing apps like Uber or Lyft. These services exemplify the Law of Least Effort by offering a more convenient alternative to traditional taxis or public transport. People often choose these apps for their ease of use and accessibility.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application:&lt;/strong&gt;  When designing products, services, or even your daily routine, aim for simplicity and ease to encourage usage and adherence.&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/0*tJPVNZ6DKsDkKxHQ&quot; alt=&quot;Law of Least Effort&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;8. Brooks’s Law: More Is Not Always Better In project management&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Picture a software development team racing against a tight deadline. In a last-minute bid to speed things up, additional programmers are brought in. Instead of accelerating progress, the project stalls further as the new team members require training and orientation. This scenario is a classic example of Brooks’s Law, which posits that adding manpower to a late project only makes it later.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application:&lt;/strong&gt;  In managing projects, consider the integration and training time new members require. Sometimes, more is not better.&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/1*thk6oY4JGjgypCY2vepJhw.png&quot; alt=&quot;Brooks&apos;s Law&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;9. &lt;strong&gt;The Law of Triviality (Bike Shedding):&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;The Focus on the Inconsequential Also known as “Bike Shedding,” this law describes how people spend disproportionate time on trivial issues. It’s a common occurrence in meetings where minor details consume hours while major issues get minimal attention.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application:&lt;/strong&gt;  Next time you’re in a meeting, play the role of the focus-shifter. Watch how discussions veer towards the inconsequential and gently steer them back to the matters that truly impact the bottom line. Remember, the color of the bike shed might be interesting, but it’s the structural integrity of the building that matters most.&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:619/0*chZMvKJ4G_mFWQzY.png&quot; alt=&quot;Law of Triviality&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;10. The Contrast Principle: Relative Perception&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Our perceptions are heavily influenced by comparisons, as illustrated by the Contrast Principle. A moderately priced meal seems affordable next to an expensive one, and a warm day feels hot following a cold spell.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application:&lt;/strong&gt;  Be aware of how contrast might be affecting your judgments. When making decisions, try to assess options on their own merits, not just in comparison to others.&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/0*0yZSHnQ5_3-R1aMn.jpg&quot; alt=&quot;Contrast Principle&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;11. The Endowment Effect: Overvaluing What We Own&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Ever wondered why it’s so hard to part with that old guitar gathering dust in the corner, even though you haven’t strummed it in years? Welcome to the Endowment Effect, where everything we own, from musical instruments to quirky collectibles, magically gains an inflated value in our eyes. It’s the reason why garage sales are battles of wills, and why decluttering feels like parting with pieces of our soul.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application&lt;/strong&gt;: Next time you hesitate to donate or sell something, ask yourself: “Am I valuing this because of its use, or just because it’s mine?”&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:625/0*6ig9ilMXHKg69KFL&quot; alt=&quot;Endowment Effect&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;12. The Serial Position Effect:&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Remembering the First and Last In lists or presentations, the first and last items are typically remembered best. This is known as the Serial Position Effect, encompassing the primacy and recency effects.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application:&lt;/strong&gt;  When delivering information, place the most important points at the beginning or end. This can be particularly effective in presentations or teaching.&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/0*v6ncqqVGVlWD9i6U.jpg&quot; alt=&quot;Serial Position Effect&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;13. The Spotlight Effect: We’re Not as Noticed as We Think&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;The Spotlight Effect is the tendency to overestimate how much others notice our appearance or behavior. It’s that feeling when you trip in public and think everyone saw.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application&lt;/strong&gt;: Remember that everyone is more concerned with themselves than with you. This can be liberating in social situations or public speaking.&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/0*pE9JZJmXEEwZ7GVZ&quot; alt=&quot;Baader-Meinhof Phenomenon&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;14. The Foot-in-the-Door Technique: Small Commitments Lead to Larger Ones&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;This technique involves getting someone to agree to a small request as a precursor to a larger one. It’s a common principle in sales and persuasion.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application&lt;/strong&gt;: Start with small requests to build up to larger ones, whether in fundraising, selling, or persuasion.&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:719/0*xYw0NW3oAdl9BRHI.jpg&quot; alt=&quot;Baader-Meinhof Phenomenon&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;15. The Ben Franklin Effect&lt;/strong&gt;: &lt;strong&gt;Seeking Consistency in Behavior&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;The Ben Franklin Effect suggests that when someone does you a favor, they’re more likely to do you another, as people seek consistency in their behavior.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application:&lt;/strong&gt; Don’t hesitate to ask for small favors. It can be a starting point for building stronger relationships.&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:759/0*mc8sRgYUKlv1P9CU.png&quot; alt=&quot;Baader-Meinhof Phenomenon&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;16. The Pygmalion Effect:&lt;/strong&gt;  &lt;strong&gt;The Power of Expectations&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Consider a manager who believes strongly in a team member’s abilities. That belief, communicated through expectations and support, often results in the employee reaching new heights in their career.This exemplifies the Pygmalion Effect.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application:&lt;/strong&gt;  Set high expectations for those around you — employees, students, even family members — and provide them with the support to meet these expectations.&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:750/0*eyHmVgol0DiUI7hw.gif&quot; alt=&quot;Baader-Meinhof Phenomenon&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;17. The IKEA Effect:&lt;/strong&gt;  &lt;strong&gt;Valuing Our Own Labor&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Consider a meal you’ve cooked from scratch, laboring over each ingredient. Somehow, it always tastes better than a store-bought dish, right? This isn’t just culinary skills at play; it’s the IKEA Effect. The effort we put into creating something, be it food, furniture, or art, endows it with extra value in our eyes. It’s a blend of pride, effort, and, yes, a little bit of love.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application&lt;/strong&gt;: Think of something you’ve built or created recently. How does the effort you put into it change how you feel about the final product?&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/0*OyS02KFPac5urJvU&quot; alt=&quot;Baader-Meinhof Phenomenon&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;18. Equity Theory:&lt;/strong&gt;  &lt;strong&gt;Balancing Input and Output&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Picture yourself at work, putting in extra hours, crafting perfect presentations, only to receive the same recognition as your colleague who seems to do the bare minimum. Frustrating, isn’t it? This is Equity Theory in action. It explains why we feel disheartened when our hard work doesn’t seem to pay off as it should. It’s about the balance, or imbalance, of what we put into our jobs (input) versus what we get out of them (output).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application:&lt;/strong&gt;  Strive for fairness in your interactions. Recognize the efforts of others and ensure they feel valued.&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/1*tzmqaSoeXQuuxdlHqi6Ruw.png&quot; alt=&quot;Baader-Meinhof Phenomenon&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;19. Hick’s Law: The Paradox of Choice&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Ever found yourself overwhelmed in the supermarket, staring blankly at the dozens of options? That’s Hick’s Law in real life. The more choices we have, whether it’s cereals, cars, or clothes, the harder it becomes to make a decision. This paradox of choice can lead to decision fatigue, making even the simplest choices feel daunting.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application&lt;/strong&gt;: Simplify choices to make decision-making easier, whether in business or personal life.&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/0*6VgY7yqIO94cbIWR.png&quot; alt=&quot;Baader-Meinhof Phenomenon&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;20. Parkinson’s Law:&lt;/strong&gt;  &lt;strong&gt;Work Expands to Fill Time&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Tasks often expand to fill the time allotted for them, a phenomenon known as Parkinson’s Law. If you give yourself a week to complete a two-hour task, it will take a week.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Practical Application&lt;/strong&gt;: Set realistic deadlines to improve efficiency. Use this principle to manage time and avoid procrastination.&lt;/p&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;img src=&quot;https://miro.medium.com/v2/resize:fit:875/0*_iL-7uTaHEzvYcUp.jpg&quot; alt=&quot;Baader-Meinhof Phenomenon&quot; style=&quot;display: block; margin: 0 auto;&quot;&gt;
&lt;/div&gt;
&lt;h1&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt;&lt;/h1&gt;
&lt;p&gt;And there you have it — a whirlwind tour through the labyrinth of our minds. These principles aren’t just textbook concepts; they’re alive in every decision we make, every relationship we hold, and every goal we chase. Understanding them is like having a roadmap to the human psyche, helping us navigate life with a bit more wisdom and a lot more awareness. So, what’s the next principle you’ll spot in your daily life?&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Data Science Pro-Tips: 5 Python Tricks You Must Know]]></title><description><![CDATA[As a data scientist, Python is the go-to tool. Its versatility, with a large ecosystem of libraries and rich data manipulation capabilities…]]></description><link>https://www.tariqmassaoudi.com/data-science-pro-tips/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/data-science-pro-tips/</guid><pubDate>Thu, 13 Apr 2023 22:12:03 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:828/format:webp/1*Vq1hXFmMiI-HbPOHciJPyQ.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;As a data scientist, Python is the go-to tool. Its versatility, with a large ecosystem of libraries and rich data manipulation capabilities, makes it a preferred language for data analysis and machine learning. But, are you fully leveraging Python’s potential to optimize your data science workflows?&lt;/p&gt;
&lt;p&gt;In this article, I will share with you some of the most practical tips and tricks for data science using Python. Whether you are a beginner looking to level up your Python skills or an experienced data scientist seeking to enhance your productivity, these tips will help you unlock new possibilities in your data science projects.&lt;/p&gt;
&lt;h1&gt;Never loop over a dataframe ! Use .apply() instead.&lt;/h1&gt;
&lt;p&gt;To perform any kind of data transformation, you will eventually need to loop over every row, perform some computation, and return the transformed column.&lt;/p&gt;
&lt;p&gt;A common mistake is to use a loop with the built-in  &lt;code class=&quot;language-text&quot;&gt;for&lt;/code&gt;  loop in Python. Please avoid doing that as it can be very slow. The correct way is to use the  &lt;code class=&quot;language-text&quot;&gt;apply&lt;/code&gt;  function in Pandas, ideally combined with a lambda function if your transformation logic is simple, or an external function that you define if the logic is complex.&lt;/p&gt;
&lt;p&gt;Here’s an overview of the  &lt;code class=&quot;language-text&quot;&gt;apply&lt;/code&gt;  function with an example using the Titanic dataset:&lt;/p&gt;
&lt;p&gt;&lt;div id=&quot;gist121904951&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot; translate=&quot;no&quot; data-color-mode=&quot;light&quot; data-light-theme=&quot;light&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        
&lt;div class=&quot;js-gist-file-update-container js-task-list-container&quot;&gt;
      &lt;div id=&quot;file-applytitanic-py&quot; class=&quot;file my-2&quot;&gt;
    
    &lt;div itemprop=&quot;text&quot;
      class=&quot;Box-body p-0 blob-wrapper data type-python  &quot;
      style=&quot;overflow: auto&quot; tabindex=&quot;0&quot; role=&quot;region&quot;
      aria-label=&quot;applyTitanic.py content, created by tariqmassaoudi on 12:06PM on April 13, 2023.&quot;
    &gt;

        
&lt;div class=&quot;js-check-hidden-unicode js-blob-code-container blob-code-content&quot;&gt;

  &lt;template class=&quot;js-file-alert-template&quot;&gt;
  &lt;div data-view-component=&quot;true&quot; class=&quot;flash flash-warn flash-full d-flex flex-items-center&quot;&gt;
  &lt;svg aria-hidden=&quot;true&quot; height=&quot;16&quot; viewBox=&quot;0 0 16 16&quot; version=&quot;1.1&quot; width=&quot;16&quot; data-view-component=&quot;true&quot; class=&quot;octicon octicon-alert&quot;&gt;
    &lt;path d=&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z&quot;&gt;&lt;/path&gt;
&lt;/svg&gt;
    &lt;span&gt;
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      &lt;a class=&quot;Link--inTextBlock&quot; href=&quot;https://github.co/hiddenchars&quot; target=&quot;_blank&quot;&gt;Learn more about bidirectional Unicode characters&lt;/a&gt;
    &lt;/span&gt;


  &lt;div data-view-component=&quot;true&quot; class=&quot;flash-action&quot;&gt;        &lt;a href=&quot;{{ revealButtonHref }}&quot; data-view-component=&quot;true&quot; class=&quot;btn-sm btn&quot;&gt;    Show hidden characters
&lt;/a&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;/template&gt;
&lt;template class=&quot;js-line-alert-template&quot;&gt;
  &lt;span aria-label=&quot;This line has hidden Unicode characters&quot; data-view-component=&quot;true&quot; class=&quot;line-alert tooltipped tooltipped-e&quot;&gt;
    &lt;svg aria-hidden=&quot;true&quot; height=&quot;16&quot; viewBox=&quot;0 0 16 16&quot; version=&quot;1.1&quot; width=&quot;16&quot; data-view-component=&quot;true&quot; class=&quot;octicon octicon-alert&quot;&gt;
    &lt;path d=&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z&quot;&gt;&lt;/path&gt;
&lt;/svg&gt;
&lt;/span&gt;&lt;/template&gt;

  &lt;table data-hpc class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;4&quot; data-paste-markdown-skip data-tagsearch-path=&quot;applyTitanic.py&quot;&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-applytitanic-py-L1&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-applytitanic-py-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-applytitanic-py-L2&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-applytitanic-py-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# Defining a custom function to categorize age groups&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-applytitanic-py-L3&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-applytitanic-py-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;def categorize_age(age):&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-applytitanic-py-L4&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-applytitanic-py-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    if age &amp;lt; 18:&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-applytitanic-py-L5&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-applytitanic-py-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;        return &amp;#39;Child&amp;#39;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-applytitanic-py-L6&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-applytitanic-py-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    elif age &amp;gt;= 18 and age &amp;lt; 30:&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-applytitanic-py-L7&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-applytitanic-py-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;        return &amp;#39;Young Adult&amp;#39;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-applytitanic-py-L8&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;8&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-applytitanic-py-LC8&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    elif age &amp;gt;= 30 and age &amp;lt; 50:&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-applytitanic-py-L9&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;9&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-applytitanic-py-LC9&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;        return &amp;#39;Adult&amp;#39;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-applytitanic-py-L10&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;10&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-applytitanic-py-LC10&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    else:&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-applytitanic-py-L11&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;11&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-applytitanic-py-LC11&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;        return &amp;#39;Senior&amp;#39;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-applytitanic-py-L12&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;12&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-applytitanic-py-LC12&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-applytitanic-py-L13&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;13&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-applytitanic-py-LC13&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# Use .apply() to apply the custom function to the &amp;quot;Age&amp;quot; column&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-applytitanic-py-L14&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;14&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-applytitanic-py-LC14&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;titanic_df[&amp;#39;Age_Category&amp;#39;] = titanic_df[&amp;#39;Age&amp;#39;].apply(categorize_age)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-applytitanic-py-L15&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;15&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-applytitanic-py-LC15&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-applytitanic-py-L16&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;16&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-applytitanic-py-LC16&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# Print the updated dataframe&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-applytitanic-py-L17&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;17&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-applytitanic-py-LC17&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;titanic_df[[&amp;#39;Age&amp;#39;, &amp;#39;Age_Category&amp;#39;]].head(5)&lt;/td&gt;
        &lt;/tr&gt;
  &lt;/table&gt;
&lt;/div&gt;


    &lt;/div&gt;

  &lt;/div&gt;

&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/tariqmassaoudi/1a30a2a35d35ad429fac8056f5759d01/raw/e82ca986700aa6d6800c621adcdf225b857a91ad/applyTitanic.py&quot; style=&quot;float:right&quot; class=&quot;Link--inTextBlock&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/tariqmassaoudi/1a30a2a35d35ad429fac8056f5759d01#file-applytitanic-py&quot; class=&quot;Link--inTextBlock&quot;&gt;
          applyTitanic.py
        &lt;/a&gt;
        hosted with &amp;#10084; by &lt;a class=&quot;Link--inTextBlock&quot; href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:258/1*28KTTJxPkzg2MEnCneyj1A.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Note that you can use  &lt;code class=&quot;language-text&quot;&gt;apply&lt;/code&gt;  to combine multiple columns from the dataframe, but you need to add  &lt;code class=&quot;language-text&quot;&gt;axis=1&lt;/code&gt;  as an argument to the  &lt;code class=&quot;language-text&quot;&gt;apply&lt;/code&gt;  function. Here’s an example using a lambda function and combining two rows,  &lt;code class=&quot;language-text&quot;&gt;price_1&lt;/code&gt;  and  &lt;code class=&quot;language-text&quot;&gt;price_2&lt;/code&gt;, to create a new row  &lt;code class=&quot;language-text&quot;&gt;tot_price&lt;/code&gt;.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;df[&quot;tot_price&quot;] = df.apply(lambda row: row[&quot;price_1&quot;]+ row[&quot;price_2&quot;], axis=1)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h1&gt;Select specific column types with select_dtypes()&lt;/h1&gt;
&lt;p&gt;A very common situation is when you have a large DataFrame with multiple columns of different data types, and you need to filter or perform operations only on columns of a specific data type. Pandas provides  &lt;code class=&quot;language-text&quot;&gt;select_dtypes()&lt;/code&gt;  as a convenient function to do that. Let’s see an example:&lt;/p&gt;
&lt;p&gt;&lt;div id=&quot;gist121910026&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot; translate=&quot;no&quot; data-color-mode=&quot;light&quot; data-light-theme=&quot;light&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        
&lt;div class=&quot;js-gist-file-update-container js-task-list-container&quot;&gt;
      &lt;div id=&quot;file-selectdtypes-py&quot; class=&quot;file my-2&quot;&gt;
    
    &lt;div itemprop=&quot;text&quot;
      class=&quot;Box-body p-0 blob-wrapper data type-python  &quot;
      style=&quot;overflow: auto&quot; tabindex=&quot;0&quot; role=&quot;region&quot;
      aria-label=&quot;selectdtypes.py content, created by tariqmassaoudi on 04:57PM on April 13, 2023.&quot;
    &gt;

        
&lt;div class=&quot;js-check-hidden-unicode js-blob-code-container blob-code-content&quot;&gt;

  &lt;template class=&quot;js-file-alert-template&quot;&gt;
  &lt;div data-view-component=&quot;true&quot; class=&quot;flash flash-warn flash-full d-flex flex-items-center&quot;&gt;
  &lt;svg aria-hidden=&quot;true&quot; height=&quot;16&quot; viewBox=&quot;0 0 16 16&quot; version=&quot;1.1&quot; width=&quot;16&quot; data-view-component=&quot;true&quot; class=&quot;octicon octicon-alert&quot;&gt;
    &lt;path d=&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z&quot;&gt;&lt;/path&gt;
&lt;/svg&gt;
    &lt;span&gt;
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      &lt;a class=&quot;Link--inTextBlock&quot; href=&quot;https://github.co/hiddenchars&quot; target=&quot;_blank&quot;&gt;Learn more about bidirectional Unicode characters&lt;/a&gt;
    &lt;/span&gt;


  &lt;div data-view-component=&quot;true&quot; class=&quot;flash-action&quot;&gt;        &lt;a href=&quot;{{ revealButtonHref }}&quot; data-view-component=&quot;true&quot; class=&quot;btn-sm btn&quot;&gt;    Show hidden characters
&lt;/a&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;/template&gt;
&lt;template class=&quot;js-line-alert-template&quot;&gt;
  &lt;span aria-label=&quot;This line has hidden Unicode characters&quot; data-view-component=&quot;true&quot; class=&quot;line-alert tooltipped tooltipped-e&quot;&gt;
    &lt;svg aria-hidden=&quot;true&quot; height=&quot;16&quot; viewBox=&quot;0 0 16 16&quot; version=&quot;1.1&quot; width=&quot;16&quot; data-view-component=&quot;true&quot; class=&quot;octicon octicon-alert&quot;&gt;
    &lt;path d=&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z&quot;&gt;&lt;/path&gt;
&lt;/svg&gt;
&lt;/span&gt;&lt;/template&gt;

  &lt;table data-hpc class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;4&quot; data-paste-markdown-skip data-tagsearch-path=&quot;selectdtypes.py&quot;&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-selectdtypes-py-L1&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-selectdtypes-py-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;import pandas as pd&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-selectdtypes-py-L2&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-selectdtypes-py-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-selectdtypes-py-L3&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-selectdtypes-py-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# Load the Titanic dataset&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-selectdtypes-py-L4&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-selectdtypes-py-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;titanic_df = pd.read_csv(&amp;#39;titanic.csv&amp;#39;)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-selectdtypes-py-L5&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-selectdtypes-py-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-selectdtypes-py-L6&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-selectdtypes-py-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# Use select_dtypes() to select only numerical columns&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-selectdtypes-py-L7&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-selectdtypes-py-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;numerical_cols = titanic_df.select_dtypes(include=&amp;#39;number&amp;#39;)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-selectdtypes-py-L8&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;8&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-selectdtypes-py-LC8&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-selectdtypes-py-L9&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;9&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-selectdtypes-py-LC9&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# Print the selected numerical columns&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-selectdtypes-py-L10&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;10&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-selectdtypes-py-LC10&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;numerical_cols.head(5)&lt;/td&gt;
        &lt;/tr&gt;
  &lt;/table&gt;
&lt;/div&gt;


    &lt;/div&gt;

  &lt;/div&gt;

&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/tariqmassaoudi/4f20a7a7a56bd3942ac1f26cc434d70a/raw/f68a89fa79de4e0aa08c7c24d021869a92ca2107/selectdtypes.py&quot; style=&quot;float:right&quot; class=&quot;Link--inTextBlock&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/tariqmassaoudi/4f20a7a7a56bd3942ac1f26cc434d70a#file-selectdtypes-py&quot; class=&quot;Link--inTextBlock&quot;&gt;
          selectdtypes.py
        &lt;/a&gt;
        hosted with &amp;#10084; by &lt;a class=&quot;Link--inTextBlock&quot; href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/v2/resize:fit:563/1*z8v0V7n-79DqRIMPxj6aBA.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;In this example, we are selecting only the numerical columns in the Titanic dataset.&lt;/p&gt;
&lt;h1&gt;Use Pandas query() instead of a boolean mask to filter your DataFrame:&lt;/h1&gt;
&lt;p&gt;Using  &lt;code class=&quot;language-text&quot;&gt;query()&lt;/code&gt;  can make your code shorter and cleaner. Here’s a comparison between the two syntaxes&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;# Filter using boolean masks  
  
titanic_df = titanic_df[(titanic_df[&quot;Sex&quot;] == &quot;female&quot;) &amp;amp; (titanic_df[&quot;Age&quot;] &gt; 18)]  
  
# Filter using query()  
  
titanic_df = titanic_df.query(&apos;Sex == &quot;female&quot; and Age &gt; 18&apos;)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Instead of having to write “titanic_df” twice in my mask, using  &lt;code class=&quot;language-text&quot;&gt;query()&lt;/code&gt;  I only had to mention the columns. It achieves the same result while being cleaner and more readable!&lt;/p&gt;
&lt;h1&gt;Use list comprehension to create lists in one line:&lt;/h1&gt;
&lt;p&gt;List comprehension is a concise and powerful technique in Python that allows you to create lists in a single line of code. It provides a concise way to generate new lists by applying an expression to each element in an iterable, such as a list, tuple, or string, and returning the result as a new list. It is shorter and more readable than using a traditional loop.&lt;/p&gt;
&lt;p&gt;Here’s the basic syntax:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;[expression for item in iterable if condition]&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here’s an example of using list comprehension to create a list of even numbers from a given list:
&lt;div id=&quot;gist121910242&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot; translate=&quot;no&quot; data-color-mode=&quot;light&quot; data-light-theme=&quot;light&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        
&lt;div class=&quot;js-gist-file-update-container js-task-list-container&quot;&gt;
      &lt;div id=&quot;file-even_numbers-py&quot; class=&quot;file my-2&quot;&gt;
    
    &lt;div itemprop=&quot;text&quot;
      class=&quot;Box-body p-0 blob-wrapper data type-python  &quot;
      style=&quot;overflow: auto&quot; tabindex=&quot;0&quot; role=&quot;region&quot;
      aria-label=&quot;even_numbers.py content, created by tariqmassaoudi on 05:05PM on April 13, 2023.&quot;
    &gt;

        
&lt;div class=&quot;js-check-hidden-unicode js-blob-code-container blob-code-content&quot;&gt;

  &lt;template class=&quot;js-file-alert-template&quot;&gt;
  &lt;div data-view-component=&quot;true&quot; class=&quot;flash flash-warn flash-full d-flex flex-items-center&quot;&gt;
  &lt;svg aria-hidden=&quot;true&quot; height=&quot;16&quot; viewBox=&quot;0 0 16 16&quot; version=&quot;1.1&quot; width=&quot;16&quot; data-view-component=&quot;true&quot; class=&quot;octicon octicon-alert&quot;&gt;
    &lt;path d=&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z&quot;&gt;&lt;/path&gt;
&lt;/svg&gt;
    &lt;span&gt;
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      &lt;a class=&quot;Link--inTextBlock&quot; href=&quot;https://github.co/hiddenchars&quot; target=&quot;_blank&quot;&gt;Learn more about bidirectional Unicode characters&lt;/a&gt;
    &lt;/span&gt;


  &lt;div data-view-component=&quot;true&quot; class=&quot;flash-action&quot;&gt;        &lt;a href=&quot;{{ revealButtonHref }}&quot; data-view-component=&quot;true&quot; class=&quot;btn-sm btn&quot;&gt;    Show hidden characters
&lt;/a&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;/template&gt;
&lt;template class=&quot;js-line-alert-template&quot;&gt;
  &lt;span aria-label=&quot;This line has hidden Unicode characters&quot; data-view-component=&quot;true&quot; class=&quot;line-alert tooltipped tooltipped-e&quot;&gt;
    &lt;svg aria-hidden=&quot;true&quot; height=&quot;16&quot; viewBox=&quot;0 0 16 16&quot; version=&quot;1.1&quot; width=&quot;16&quot; data-view-component=&quot;true&quot; class=&quot;octicon octicon-alert&quot;&gt;
    &lt;path d=&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z&quot;&gt;&lt;/path&gt;
&lt;/svg&gt;
&lt;/span&gt;&lt;/template&gt;

  &lt;table data-hpc class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;4&quot; data-paste-markdown-skip data-tagsearch-path=&quot;even_numbers.py&quot;&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-even_numbers-py-L1&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-even_numbers-py-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-even_numbers-py-L2&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-even_numbers-py-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;even_numbers = [x for x in numbers if x % 2 == 0]&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-even_numbers-py-L3&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-even_numbers-py-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;print(even_numbers)&lt;/td&gt;
        &lt;/tr&gt;
  &lt;/table&gt;
&lt;/div&gt;


    &lt;/div&gt;

  &lt;/div&gt;

&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/tariqmassaoudi/59479675886a6e72725f5e2d15662b81/raw/ec9fb76bcc888ed297331e281b05c3a5f6e28b85/even_numbers.py&quot; style=&quot;float:right&quot; class=&quot;Link--inTextBlock&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/tariqmassaoudi/59479675886a6e72725f5e2d15662b81#file-even_numbers-py&quot; class=&quot;Link--inTextBlock&quot;&gt;
          even_numbers.py
        &lt;/a&gt;
        hosted with &amp;#10084; by &lt;a class=&quot;Link--inTextBlock&quot; href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;[2, 4, 6, 8, 10]&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Keep in mind that you can also create dictionary comprehensions, set comprehensions, and generator comprehensions in Python.&lt;/p&gt;
&lt;h1&gt;Enhance Your Loops with enumerate() and zip() in Python:&lt;/h1&gt;
&lt;p&gt;&lt;code class=&quot;language-text&quot;&gt;enumerate()&lt;/code&gt;  is used to loop over an iterable while keeping track of the index or position of each item. It helps you avoid using an extra variable, like  &lt;code class=&quot;language-text&quot;&gt;i&lt;/code&gt;. The basic syntax for using  &lt;code class=&quot;language-text&quot;&gt;enumerate()&lt;/code&gt;  in a loop is as follows:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;for index, item in enumerate(iterable):  
    # Do something with index and item&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here’s an example:
&lt;div id=&quot;gist121910340&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot; translate=&quot;no&quot; data-color-mode=&quot;light&quot; data-light-theme=&quot;light&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        
&lt;div class=&quot;js-gist-file-update-container js-task-list-container&quot;&gt;
      &lt;div id=&quot;file-enumerate-py&quot; class=&quot;file my-2&quot;&gt;
    
    &lt;div itemprop=&quot;text&quot;
      class=&quot;Box-body p-0 blob-wrapper data type-python  &quot;
      style=&quot;overflow: auto&quot; tabindex=&quot;0&quot; role=&quot;region&quot;
      aria-label=&quot;enumerate.py content, created by tariqmassaoudi on 05:13PM on April 13, 2023.&quot;
    &gt;

        
&lt;div class=&quot;js-check-hidden-unicode js-blob-code-container blob-code-content&quot;&gt;

  &lt;template class=&quot;js-file-alert-template&quot;&gt;
  &lt;div data-view-component=&quot;true&quot; class=&quot;flash flash-warn flash-full d-flex flex-items-center&quot;&gt;
  &lt;svg aria-hidden=&quot;true&quot; height=&quot;16&quot; viewBox=&quot;0 0 16 16&quot; version=&quot;1.1&quot; width=&quot;16&quot; data-view-component=&quot;true&quot; class=&quot;octicon octicon-alert&quot;&gt;
    &lt;path d=&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z&quot;&gt;&lt;/path&gt;
&lt;/svg&gt;
    &lt;span&gt;
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      &lt;a class=&quot;Link--inTextBlock&quot; href=&quot;https://github.co/hiddenchars&quot; target=&quot;_blank&quot;&gt;Learn more about bidirectional Unicode characters&lt;/a&gt;
    &lt;/span&gt;


  &lt;div data-view-component=&quot;true&quot; class=&quot;flash-action&quot;&gt;        &lt;a href=&quot;{{ revealButtonHref }}&quot; data-view-component=&quot;true&quot; class=&quot;btn-sm btn&quot;&gt;    Show hidden characters
&lt;/a&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;/template&gt;
&lt;template class=&quot;js-line-alert-template&quot;&gt;
  &lt;span aria-label=&quot;This line has hidden Unicode characters&quot; data-view-component=&quot;true&quot; class=&quot;line-alert tooltipped tooltipped-e&quot;&gt;
    &lt;svg aria-hidden=&quot;true&quot; height=&quot;16&quot; viewBox=&quot;0 0 16 16&quot; version=&quot;1.1&quot; width=&quot;16&quot; data-view-component=&quot;true&quot; class=&quot;octicon octicon-alert&quot;&gt;
    &lt;path d=&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z&quot;&gt;&lt;/path&gt;
&lt;/svg&gt;
&lt;/span&gt;&lt;/template&gt;

  &lt;table data-hpc class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;4&quot; data-paste-markdown-skip data-tagsearch-path=&quot;enumerate.py&quot;&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-enumerate-py-L1&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-enumerate-py-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;names = [&amp;quot;Ali&amp;quot;, &amp;quot;Ahmed&amp;quot;, &amp;quot;Bob&amp;quot;, &amp;quot;Mary&amp;quot;]&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-enumerate-py-L2&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-enumerate-py-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;for index, name in enumerate(names):&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-enumerate-py-L3&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-enumerate-py-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    print(f&amp;quot;Index: {index}, Name: {name}&amp;quot;)&lt;/td&gt;
        &lt;/tr&gt;
  &lt;/table&gt;
&lt;/div&gt;


    &lt;/div&gt;

  &lt;/div&gt;

&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/tariqmassaoudi/51eb2e161817c94a1b98ef8b3ca40799/raw/163a5b4efc1558c75d6a849c2c549ebcc94fd692/enumerate.py&quot; style=&quot;float:right&quot; class=&quot;Link--inTextBlock&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/tariqmassaoudi/51eb2e161817c94a1b98ef8b3ca40799#file-enumerate-py&quot; class=&quot;Link--inTextBlock&quot;&gt;
          enumerate.py
        &lt;/a&gt;
        hosted with &amp;#10084; by &lt;a class=&quot;Link--inTextBlock&quot; href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;Index: 0, Name: Ali  
Index: 1, Name: Ahmed  
Index: 2, Name: Bob  
Index: 3, Name: Mary&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code class=&quot;language-text&quot;&gt;zip()&lt;/code&gt;  is used to combine two or more sequences into a single iterable object that can be looped over in parallel. It helps you avoid using multiple nested loops, making your code cleaner. The basic syntax for using  &lt;code class=&quot;language-text&quot;&gt;zip()&lt;/code&gt;  in a loop is as follows:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;for item1, item2 in zip(sequence1, sequence2):  
    # Do something with item1 and item2&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here’s an example:
&lt;div id=&quot;gist121910381&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot; translate=&quot;no&quot; data-color-mode=&quot;light&quot; data-light-theme=&quot;light&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        
&lt;div class=&quot;js-gist-file-update-container js-task-list-container&quot;&gt;
      &lt;div id=&quot;file-zip-py&quot; class=&quot;file my-2&quot;&gt;
    
    &lt;div itemprop=&quot;text&quot;
      class=&quot;Box-body p-0 blob-wrapper data type-python  &quot;
      style=&quot;overflow: auto&quot; tabindex=&quot;0&quot; role=&quot;region&quot;
      aria-label=&quot;zip.py content, created by tariqmassaoudi on 05:16PM on April 13, 2023.&quot;
    &gt;

        
&lt;div class=&quot;js-check-hidden-unicode js-blob-code-container blob-code-content&quot;&gt;

  &lt;template class=&quot;js-file-alert-template&quot;&gt;
  &lt;div data-view-component=&quot;true&quot; class=&quot;flash flash-warn flash-full d-flex flex-items-center&quot;&gt;
  &lt;svg aria-hidden=&quot;true&quot; height=&quot;16&quot; viewBox=&quot;0 0 16 16&quot; version=&quot;1.1&quot; width=&quot;16&quot; data-view-component=&quot;true&quot; class=&quot;octicon octicon-alert&quot;&gt;
    &lt;path d=&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z&quot;&gt;&lt;/path&gt;
&lt;/svg&gt;
    &lt;span&gt;
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      &lt;a class=&quot;Link--inTextBlock&quot; href=&quot;https://github.co/hiddenchars&quot; target=&quot;_blank&quot;&gt;Learn more about bidirectional Unicode characters&lt;/a&gt;
    &lt;/span&gt;


  &lt;div data-view-component=&quot;true&quot; class=&quot;flash-action&quot;&gt;        &lt;a href=&quot;{{ revealButtonHref }}&quot; data-view-component=&quot;true&quot; class=&quot;btn-sm btn&quot;&gt;    Show hidden characters
&lt;/a&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;/template&gt;
&lt;template class=&quot;js-line-alert-template&quot;&gt;
  &lt;span aria-label=&quot;This line has hidden Unicode characters&quot; data-view-component=&quot;true&quot; class=&quot;line-alert tooltipped tooltipped-e&quot;&gt;
    &lt;svg aria-hidden=&quot;true&quot; height=&quot;16&quot; viewBox=&quot;0 0 16 16&quot; version=&quot;1.1&quot; width=&quot;16&quot; data-view-component=&quot;true&quot; class=&quot;octicon octicon-alert&quot;&gt;
    &lt;path d=&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z&quot;&gt;&lt;/path&gt;
&lt;/svg&gt;
&lt;/span&gt;&lt;/template&gt;

  &lt;table data-hpc class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;4&quot; data-paste-markdown-skip data-tagsearch-path=&quot;zip.py&quot;&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-zip-py-L1&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-zip-py-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;names = [&amp;quot;Ali&amp;quot;, &amp;quot;Ahmed&amp;quot;, &amp;quot;Bob&amp;quot;, &amp;quot;Mary&amp;quot;]&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-zip-py-L2&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-zip-py-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;ages = [25, 30, 35, 40]&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-zip-py-L3&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-zip-py-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;for name, age in zip(names, ages):&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-zip-py-L4&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-zip-py-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    print(f&amp;quot;Name: {name}, Age: {age} years&amp;quot;)&lt;/td&gt;
        &lt;/tr&gt;
  &lt;/table&gt;
&lt;/div&gt;


    &lt;/div&gt;

  &lt;/div&gt;

&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/tariqmassaoudi/318a78fd08c45dfc489004b23cf1c8cb/raw/cc55ee9f5662410fcae262d6a0536ffe6b9e4603/zip.py&quot; style=&quot;float:right&quot; class=&quot;Link--inTextBlock&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/tariqmassaoudi/318a78fd08c45dfc489004b23cf1c8cb#file-zip-py&quot; class=&quot;Link--inTextBlock&quot;&gt;
          zip.py
        &lt;/a&gt;
        hosted with &amp;#10084; by &lt;a class=&quot;Link--inTextBlock&quot; href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;Name: Ali, Age: 25 years  
Name: Ahmed, Age: 30 years  
Name: Bob, Age: 35 years  
Name: Mary, Age: 40 years&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h1&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt;&lt;/h1&gt;
&lt;p&gt;To sum up, by implementing these top 5 Python tips in your data science projects, you can make your code cleaner and more readable.&lt;/p&gt;
&lt;p&gt;I hope that these tips will help you level up as data scientist!&lt;/p&gt;
&lt;p&gt;If you managed to get here Congratulations. Thanks for reading, I hope you’ve enjoyed the article. Feel free to reach out to me on  &lt;a href=&quot;https://www.linkedin.com/in/tariqmassaoudi/&quot;&gt;LinkedIn&lt;/a&gt;  for further discussion or personal contact.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[How I Passed The AWS Solution Architect Associate (SAA-C03)]]></title><description><![CDATA[How I Passed The AWS Solution Architect Associate (SAA-C03)  I passed the AWS Solutions Architect Associate (SSA-003) exam in December 202…]]></description><link>https://www.tariqmassaoudi.com/how-i-passed-aws-solution-architect/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/how-i-passed-aws-solution-architect/</guid><pubDate>Fri, 02 Dec 2022 22:12:03 GMT</pubDate><content:encoded>&lt;h3&gt;How I Passed The AWS Solution Architect Associate (SAA-C03)&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/1000/0*7wvhCR-8_88YHVV9.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;I passed the AWS Solutions Architect Associate (SSA-003) exam in December 2022. In this article I’ll share with you resources I used, some tips for the exam and some notes I took during preparation. With some prior experience using AWS (S3, Lambda, EC2, RDS) and some general IT knowledge, it took me around &lt;strong&gt;one month&lt;/strong&gt; of light preparation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What you’ll learn?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You’ll learn how to design good systems on AWS, which means given a requirement on resiliency, performance, cost, security and availability. How can I glue different AWS services to design the best system possible. This means you’ll have to know deeply the &lt;strong&gt;services on AWS&lt;/strong&gt; and also &lt;strong&gt;best practices&lt;/strong&gt; for designing systems.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How did I prepare?&lt;/strong&gt;&lt;/p&gt;
&lt;ol class=&quot;list-disc&quot;&gt;
&lt;li&gt;Took a &lt;a href=&quot;https://www.udemy.com/course/aws-certified-solutions-architect-associate-saa-c03/&quot;&gt;course on Udemy&lt;/a&gt; by “Ultimate AWS Certified Solutions Architect Associate SAA-C03” By Stephane Maarek, comprehensive gives you high level understanding with an emphasis on the “why”, you need to complement that by playing around in AWS console doing hands on yourself.&lt;/li&gt;
&lt;li&gt;While taking the course referred to official documentation for each service to get to the fine details.&lt;/li&gt;
&lt;li&gt;Did 6 &lt;a href=&quot;https://www.udemy.com/course/aws-certified-solutions-architect-associate-amazon-practice-exams-saa-c03/&quot;&gt;mock exams&lt;/a&gt; by Jon Bonso, provided very good explanation on each question and is really close to the real exam. You should aim for a score &gt;80% on these mock exams before taking the real one.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Some Tips:&lt;/strong&gt;&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Most of the exam is about the core services (S3, EC2, SQS, VPC, etc ..), it is helpful to study newer services but only high level understanding is required.&lt;/li&gt;
&lt;li&gt;To check your knowledge on a certain service you can ask yourself, what does this service do? when should I use this service? How does it integrate with other services? What about security and high availability?&lt;/li&gt;
&lt;li&gt;In some questions you’ll find multiple answers that technically work, read the question again, it will often mention something like most “cost effective”, or “with least operational overhead”. Use this to guide your final choice.&lt;/li&gt;
&lt;li&gt;Take it very slowly the exam is about 2 hours, I finished all the questions with 40 minutes to spare.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Below you find my &lt;strong&gt;non comprehensive list of notes&lt;/strong&gt;, I took during preparation.&lt;/p&gt;
&lt;h3&gt;EC2:&lt;/h3&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;A Service to rent VMs.&lt;/li&gt;
&lt;li&gt;Compute optimized instances start with C, Memory Optimized start with R, Storage optimized start with I or D.&lt;/li&gt;
&lt;li&gt;Give your instance permissions with &lt;strong&gt;IAM Roles.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;On demand&lt;/strong&gt; instances are most expensive good for temporary workloads&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reserved instances&lt;/strong&gt; for 1 or 3 years good for consistent demand, &lt;strong&gt;convertible reserved&lt;/strong&gt; can be exchanged for instance type of same family but are more expensive&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spot instances&lt;/strong&gt; are cheap and good only for workloads that can be interrupted. With &lt;strong&gt;dedicated hosts&lt;/strong&gt; you get direct access to the hardware, with dedicated instances you make sure no other customer is using same hardware as you. To terminate &lt;strong&gt;persistent spot instances&lt;/strong&gt;, cancel the request first then terminate the instances.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ENIs&lt;/strong&gt; are network cards, they have public / private Ips, they can be attached or detached from EC2 instances, they’re useful for failovers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use cluster placement&lt;/strong&gt; for HCP (High performance computing) applications (Single AZ), Use &lt;strong&gt;spread placement&lt;/strong&gt; for critical applications (max 7 instance per AZ) , Use &lt;strong&gt;Partition&lt;/strong&gt; for best of both worlds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Elastic Ips&lt;/strong&gt; let you keep same public IP when you stop then start an instance, you pay for elastic IPs you are not using.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hibernation&lt;/strong&gt; lets you save RAM state, you are billed when instance preparing to hibernate, you are not billed if instance is preparing to stop.3.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Limits&lt;/strong&gt;: 20 running instances per region, also a VCPU limit, need to get validation from AWS to increase limits.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Storage:&lt;/h3&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;Object storage&lt;/strong&gt;: S3&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File storage&lt;/strong&gt;: EFS (EFS has two tiers standard and Infrequent Access), FSx for luste, FSx for windows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Volume Storage&lt;/strong&gt;: EBS. if you need more than 16K IOPS use provisioned IOPS EBS, to encrypt an unencrypted EBS, create a snapshot, copy snapshot enabling encryption and create new volume from this snapshot.&lt;/li&gt;
&lt;li&gt;You can use EBS snapshots to move data between AZs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AWS datasync&lt;/strong&gt; to migrate on premises storage to S3, EFS or FSx. &lt;strong&gt;Storage gateway&lt;/strong&gt; connects your on premises storage to AWS, File gateway uses NFS and SMB, Volume gateway syncs data to S3 and tape gateway offers compatibility with tape data uses S3.&lt;/li&gt;
&lt;li&gt;On S3 you can enable &lt;strong&gt;versioning&lt;/strong&gt; and &lt;strong&gt;MFA delete&lt;/strong&gt; to prevent accidental deletions, you can enable encryption by default on bucket settings, you can add a header to your put request to be able to encrypt specific file, you can add bucket policy to prevent files that are not encrypted to be uploaded.&lt;/li&gt;
&lt;li&gt;To migrate data if you don’t have good network bandwidth use a &lt;strong&gt;Snowdevice&lt;/strong&gt; (snowcone physically small or snowball edge or snowmobile). Snowball Edge max 80 TB, Snowcone 8 TB. Snowball can’t import directly to glacier you have to use lifecycle policy&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Load Balancers :&lt;/h3&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Want to route same client to same machine? Enable “&lt;strong&gt;sticky sessions&lt;/strong&gt;” option.&lt;/li&gt;
&lt;li&gt;All load balancer support &lt;strong&gt;heathchecks&lt;/strong&gt; and use &lt;strong&gt;target groups&lt;/strong&gt;, you can setup multiple target groups&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cross-zone load balancing&lt;/strong&gt; will make traffic even among instances in multiple AZs, useful when you AZs have different number of instances. Enabled by default ALB, need to enable and pay for traffic for NLB, need to enable but free for CLB.&lt;/li&gt;
&lt;li&gt;ALB/NLB can use multiple &lt;strong&gt;SSL certificates&lt;/strong&gt; using SNI, CLB can’t.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Connection Draining&lt;/strong&gt; (CLB) / &lt;strong&gt;Deregistration Delay&lt;/strong&gt; (NLB) : if instance is deregistering you wanna give some time for clients to complete their requests before removing it from target group, you can configure this delay. Set it based on request length if short or long, can be between 1 and 3600 seconds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Classic Load Balancer&lt;/strong&gt; (CLB): Old one, support TCP &amp;#x26; HTTP/S, supports heathchecks, fixed hostnames. No reason to use over modern ones.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Application Load Balancer&lt;/strong&gt; (ALB): Support HTTP/S (layer 7), routing rules based on path, url parameters, etc … Good for microservices (Docker / ECS) , server won’t see original request, if target wants original ip or port needs to get them from request header forwarded by ALB. Can use lambda function as target group.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network Load Balancer&lt;/strong&gt; (NLB): Supports TCP &amp;#x26; UDP (layer 4), very high performance, less latency than ALB, you have one static IP per AZ. Forwards the original request from client. You can use with EC2 instances &amp;#x26; IP addresses.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gateway load balancer&lt;/strong&gt;: Make all traffic go through 3rd party security systems like firewalls, intrusion detection, etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Auto Scaling Groups (ASG):&lt;/h3&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Works with EC2 Instances, Integrates with load balancers, you can set a min and max size, desired capacity is number of instances launched initially. Does health checks by default.&lt;/li&gt;
&lt;li&gt;To create one you need a &lt;strong&gt;launch template&lt;/strong&gt; which defines what kind of EC2 instances you want in your ASG. You scale based on cloudwatch alarm (a certain metric) to create scaling policies.&lt;/li&gt;
&lt;li&gt;Can’t modify launch templates once created, you need to replace it with new one. ASGs are free, you only pay for resources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Scaling policies:&lt;/strong&gt;&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Target Scaling: target certain metric for example want cpu usage to be 50%&lt;/li&gt;
&lt;li&gt;Simple Scaling: If CPU&gt;90 % for example add 3 instances.&lt;/li&gt;
&lt;li&gt;Scheduled Scaling: Every Sunday from 8 AM to 5 PM add 6 instances&lt;/li&gt;
&lt;li&gt;Predictive Scaling: Scale based on historical data using forecasting.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cooldown Period&lt;/strong&gt;: After scaling trigger happens wait (300 secs default) before another trigger can happen. Used to wait for scaling metric to stabilize before triggering scaling.&lt;/li&gt;
&lt;li&gt;You can define &lt;strong&gt;lifecycle hooks&lt;/strong&gt; to do extra stuff before launch/ terminating for example, before launching an extra instance you want it to run some script to download software.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;ASG Termination Policy:&lt;/strong&gt;&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Find AZ with most number of instances.&lt;/li&gt;
&lt;li&gt;Delete one with oldest launch template.&lt;/li&gt;
&lt;li&gt;Delete instance closest to next billing hour.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Relational Database Service (RDS):&lt;/h3&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Postgres, MySQL, MariaDB, Oracle, SQL Server, Aurora (AWS only)&lt;/li&gt;
&lt;li&gt;RDS Storage can autoscale&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Read Replica&lt;/strong&gt;: can create up to 5, same AZ, cross AZ or Cross region, Async replication between main DB and read replicas. Read replicas can be promoted to own main DB. Cross region read replica have network costs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi AZ&lt;/strong&gt; for Disaster Recovery: Standby DB in another AZ with automatic failover using same DNS name, uses synchronous replication.&lt;/li&gt;
&lt;li&gt;Possible to use Multi AZ with your Read Replicas.&lt;/li&gt;
&lt;li&gt;You can activate Multi AZ after deployment just by changing config.&lt;/li&gt;
&lt;li&gt;Continuous Backup and Restore with retention up to 35 days.&lt;/li&gt;
&lt;li&gt;Scale by read replica or bigger instance.&lt;/li&gt;
&lt;li&gt;Supports encryption at rest using &lt;strong&gt;KMS&lt;/strong&gt; and in-flight encryption with SSL, you can enforce SSL, method depends on your db engine.&lt;/li&gt;
&lt;li&gt;To encrypt RDS after deployment, create snapshot, encrypt the snapshot and restore your DB from the snapshot.&lt;/li&gt;
&lt;li&gt;You can use &lt;strong&gt;IAM Auth&lt;/strong&gt; for MySQL and Postgres.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Aurora:&lt;/h3&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Aurora only supports MySQL and Postgres, with Aurora you can get &lt;strong&gt;15 read replicas&lt;/strong&gt;, automatic storage scaling. High availability by default. Supports cross region replication.&lt;/li&gt;
&lt;li&gt;Read replicas can auto scale, you only deal with reader endpoint/ writer endpoint. There’s one writer and multiple readers.&lt;/li&gt;
&lt;li&gt;You can enable &lt;strong&gt;Multi-Master&lt;/strong&gt; if you need high availability for writer node, makes all instances capable of both write and read.&lt;/li&gt;
&lt;li&gt;Comes in &lt;strong&gt;provisioned&lt;/strong&gt; or &lt;strong&gt;serverless&lt;/strong&gt; modes. Can group your read replicas into custom endpoint good if you have different type of reader instances and you want to group them based on workload.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Elasticache:&lt;/h3&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Managed Redis or Memcached, in memory databases for low latency and high performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Uses cases&lt;/strong&gt;: cache common queries to help reduce load off database or store user session.&lt;/li&gt;
&lt;li&gt;Redis VS Memcached: Redis has Multi AZ, Read Replicas, Backups. Memcached faster but no durability features.&lt;/li&gt;
&lt;li&gt;In Redis to enable &lt;strong&gt;Redis Auth&lt;/strong&gt; that’s used for security you need to enable encryption in transit.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Dynamo DB:&lt;/h3&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Managed &lt;strong&gt;NoSQL database&lt;/strong&gt;, &lt;strong&gt;Provisioned&lt;/strong&gt; read/write capacity or &lt;strong&gt;on demand&lt;/strong&gt; mode to pay for whatever read/writes you actually consume. &lt;strong&gt;Provisioned&lt;/strong&gt; mode supports autoscaling for reads and writes. Key/Value db. Muti AZ by default. You can make it &lt;strong&gt;global&lt;/strong&gt; enabling Dynamo DB global tables. Supports &lt;strong&gt;backup and restore&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Enable &lt;strong&gt;DAX&lt;/strong&gt; for auto read cache, whenever you need accelerated reads.&lt;/li&gt;
&lt;li&gt;Security and auth is integrated with IAM. Use &lt;strong&gt;dynamo DB streams&lt;/strong&gt; to detect changes and trigger events based on them, you need to enable this for global tables to work.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;SQS:&lt;/h3&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Used to &lt;strong&gt;decouple applications&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;default retention for messages is 4 days with max of 14 days, 256KB per message, Consumers and producers, polling for messages is asking queue for messages, a consumer can receive up to 10 messages at a time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Queue Access Policy&lt;/strong&gt; can allow another aws account to access your queue.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Message visibility timeout&lt;/strong&gt; indicates how much time message is invisible to other consumers while pooling or how much time consumer has to process messages by default it is 30 secs and max is 12 hours, after the timeout the messages return to the queue if not deleted. You can change message visibility in real time using API call &lt;strong&gt;Change Message Visibility.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delivery Delay&lt;/strong&gt; allows you to set delay up to 15 mins, messages will only be visible after the delay.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dead Letter Queue&lt;/strong&gt;: If message has been returned to queue many times maybe there’s an error and you may want to get rid of it, after max receives messages will be sent automatically to DLQ you can then process/debug them and return them to regular queue.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Long Pooling vs Short Pooling&lt;/strong&gt;:default is short pooling, if queue is empty sqs will sent empty response immediately you will have to send new request, with long pooling sqs will wait up to 20 secs for new messages to arrive, it is useful to decrease number of API calls.&lt;/li&gt;
&lt;li&gt;To implement Request Response Systems use the built in &lt;strong&gt;SQS Temporary Queue Client.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;FIFO Queue&lt;/strong&gt;: messages are ordered, limited to 300 messages/s to 3000 messages/s using batching.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;SNS:&lt;/h3&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;Pub/Sub&lt;/strong&gt; pattern, publisher sends the message to service, SNS distributes the message to all subscribers, you can filter messages so not all subscribers get all messages, up to 12M subs per topic, subs can be email, sms, http enpoints or aws services(sqs, lambda, etc ..)&lt;/li&gt;
&lt;li&gt;Supports inflight / at rest encryption, you can use &lt;strong&gt;SNS access policies&lt;/strong&gt; for cross account sharing or allowing other services to write to SNS.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SNS fanout pattern&lt;/strong&gt; to send messages efficiently to multiple SQS queues. You can preserve order and ensure deduplication by using SNS FIFO that works only with SQS FIFO.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Kinesis:&lt;/h3&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Analyse streaming data in &lt;strong&gt;real time&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Streams&lt;/strong&gt;: Capture, process and store datastreams.&lt;/li&gt;
&lt;li&gt;Performance component is &lt;strong&gt;shards&lt;/strong&gt;, 1 MB /s or 1000 messages/s per shard writing, for reading you get 2 MB/s per shard for all consumers or for each consumer but more expensive (enhanced fanout), each record can be up to 1 MB. You can choose provisioned or on demand capacity, the latter will scale automatically with more shards. Data Retention for 1 up to 365 days, you can replay data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Firehose&lt;/strong&gt;: Load data streams into AWS data stores. Can read from data streams and optionally use lambda to transform the data before writing it in batch to an AWS destination S3, Redshift, Elastic Search or custom destination. Can send min of 32 MB per batch.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Analytics&lt;/strong&gt;: Analyse data streams with SQL. Serverless integrates well wil Firehose and Data Streams. Used for timeseries analytics, real time dashboards, etc ..&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Video Streams&lt;/strong&gt;: Capture, process and store video streams.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Elastic Container Service (ECS):&lt;/h3&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Good for &lt;strong&gt;microservices&lt;/strong&gt;, you can store images on Amazon Container Register (ECR), You can do EC2 Launch Type or Fargate Launch Type(serverless), Docker containers you use are called “tasks”, ECS tasks can be invoked by AWS Event Bridge.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;IAM Roles for ECS:&lt;/strong&gt; for EC2 Launch you get EC2 Instance Profiles, for EC2 and Fargate you get task roles, you can then finetune access based on containers (tasks). You can integrate both launch types with ALB or NLB. The file system you use with ECS is EFS because it is shared. Can’t use FSx for Lustre or Windows, you can’t use S3 as file system.&lt;/li&gt;
&lt;li&gt;To use you create &lt;strong&gt;task definition&lt;/strong&gt; then you launch a service from that task definition. You can autoscale on the task level using AWS Application Auto Scaling, can be based on CPU, RAM or ALB request count.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rooling updates:&lt;/strong&gt; you can control how tasks start and stop when updating using min heathy percent / max healthy percent metric.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Elastic Kubernetes Service (EKS)&lt;/strong&gt; : managed kubernetes on AWS, alternative to ECS with opensource API, supports EC2 or Fargate.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;Route 53&lt;/strong&gt;:&lt;/h3&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Used to &lt;strong&gt;register domains / DNS manager&lt;/strong&gt;. You can use public or private hosted zones, private one only work inside your VPCs. TTL (Time to Live): Client will cache the result of DNS for the duration of TTL default is 300 secs. TTL Not mandatory for &lt;strong&gt;Alias Record&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;CNAME point hostname to another hostname but you can’t use root domain, To use root domain need activate &lt;strong&gt;Alias with A record&lt;/strong&gt;. You can’t set Alias record for EC2 DNS name.&lt;/li&gt;
&lt;li&gt;To enable healthchecks you must allow traffic from route 53 health checkers to your resources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Routing Policies&lt;/strong&gt;:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;Simple&lt;/strong&gt;: specify one or multiple Ips, it will route to randomly chosen one, no health checks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multivalue&lt;/strong&gt;: Like simple but with healthchecks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Weighted&lt;/strong&gt;: Assign different weights to different resources and route based on relative weight, supports heathchecks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Failover&lt;/strong&gt;: used for Disaster Recovery, uses healthcheck, you can only use 2 records here one primary and one secondary.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Geolocation&lt;/strong&gt;: You can use it to change behavior based on user country for example block content or change language.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Geoproximity&lt;/strong&gt;: Route based on geographic location, t’s more flexible than geolocation you can use bias to expand or shrink geo region allocated to a specific resource, with 0 bias users go to closest resource.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency&lt;/strong&gt;: Route to closest aws region, supports heathchecks.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;if you managed to get here Congratulations. Thanks for reading, I hope you’ve enjoyed the article. For personal contact or discussion, feel free to reach out to me on &lt;a href=&quot;https://www.linkedin.com/in/tariqmassaoudi/&quot;&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[From Idea to Reality: Building a Price History Tool for Moroccan Ecommerce]]></title><description><![CDATA[Have you ever wanted to track the prices of products on an ecommerce platform but found that no price tracker existed for that specific…]]></description><link>https://www.tariqmassaoudi.com/jumia-price-comparator/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/jumia-price-comparator/</guid><pubDate>Tue, 11 Oct 2022 22:40:32 GMT</pubDate><content:encoded>&lt;p&gt;Have you ever wanted to track the prices of products on an ecommerce platform but found that no price tracker existed for that specific platform? In this article, I’ll share with you how I built a Price Tracker app for Moroccan ecommerce platforms and hosted it on AWS for free. This simple end-to-end data engineering project includes some UX elements and will teach you about web scraping and how to use some of AWS’s services. You can try the app using this &lt;a href=&quot;https://www.tariqmassaoudi.com/jumiaapp/&quot;&gt;link&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;The Context &amp;#x26; The Plan:&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;The main value of a Price Tracker is to provide you with the historical price of a product so that you can make your purchasing decision based on data, among other criteria, and minimize the effect of FOMO/discounts that can be in some cases just a form of marketing.&lt;/p&gt;
&lt;p&gt;Price trackers exist for all major international ecommerce websites, such as Amazon, eBay, and Alibaba, but they don’t exist for ecommerce platforms in Morocco. The goal of this project is to create a simple price tracker and host it for free. To achieve the latter, I chose to make use of AWS free tier, which offers quite generous cloud resources, just enough to bootstrap this kind of project if used efficiently. Learn more about AWS Free Tier &lt;a href=&quot;https://aws.amazon.com/free/&quot;&gt;Here&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Technical Architecture:&lt;/h3&gt;
&lt;p&gt;The following picture summarizes the architecture I chose to spread out and make use of a variety of AWS components, which is more optimal for efficiency:&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 1000px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/7ddb8deab245fe2662c58bebe54d33ae/73dae/architecture.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 36.8%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAHCAYAAAAIy204AAAACXBIWXMAAAsTAAALEwEAmpwYAAABJUlEQVQoz42RQUsDMRCF9/8f/RteBREParVIT1aKFEqhVdu62rW7bTebTSbJJ0mtVEXwwUuGycybRyYzxqC1pmkarLWEEL7ovCcixtZKqmvbNt37+LDHe08Wg5/Y5/Kiotsf0+iW/yAJxuPQ1Z4R1hiOjru8LCuM1ilnVkum1xe8FgXfe/1OEMKfE7fa8L6uEBEaVVPVljwv2cyeCCK/HYZAVhZLHu7vol9K49iIx4tD5RPK+RBlo7Pd0NPeLcPReXIymE6ptU5/qOqaZtghSEv2vHjj5LKHareMKsV4rSibmtmgw/zuDGdUkjMSqNcTghqTV4qbxzFluUIplQQX/SvEaLLwuUkJHiuSKN5jPahWkptIl94CVsA7iZvDOZcY0ToQ5/gAgqoeg68J3kcAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;architecture&quot;
        title=&quot;architecture&quot;
        src=&quot;/static/7ddb8deab245fe2662c58bebe54d33ae/00d43/architecture.png&quot;
        srcset=&quot;/static/7ddb8deab245fe2662c58bebe54d33ae/63868/architecture.png 250w,
/static/7ddb8deab245fe2662c58bebe54d33ae/0b533/architecture.png 500w,
/static/7ddb8deab245fe2662c58bebe54d33ae/00d43/architecture.png 1000w,
/static/7ddb8deab245fe2662c58bebe54d33ae/aa440/architecture.png 1500w,
/static/7ddb8deab245fe2662c58bebe54d33ae/e8950/architecture.png 2000w,
/static/7ddb8deab245fe2662c58bebe54d33ae/73dae/architecture.png 2122w&quot;
        sizes=&quot;(max-width: 1000px) 100vw, 1000px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The architecture split into two main sections:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Scraping/ETL:&lt;/strong&gt; This section is responsible for periodically getting the data, transforming it, and loading it into a Postgres database. I made use of Airflow for scheduling and coordinating tasks written in Python and S3 for an extra backup of the data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data Delivery / UX :&lt;/strong&gt; In classic web app fashion, we have a front-end UI written in JavaScript, calling a REST API which is, in this case, powered by AWS Lambda, which interacts with our Postgres database in RDS. Making efficient use of resources like this is what made it possible to host the project for free.&lt;/p&gt;
&lt;h3&gt;Data Model:&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 1000px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/be48b76da8b8c07704ecc92e734180e1/d8104/datamodel.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 36.8%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAHCAYAAAAIy204AAAACXBIWXMAAAsTAAALEwEAmpwYAAABF0lEQVQoz21R7W7DIAzM+79h1R+t1DZNgEII35Bwkz0hTdssGQv7fPbBhF/Wjg6zFSwiYpERSmecZwfQsdnMeflJqPWAsYXrq/rGfUzB1HtHqR0pn2itg+4xZki1sfuQONdaQ4gZy6JgNo9SKnJOMMZAiBW7tSilYPKhYZURr8VDqIiUCpzbIYVACB7WWoQQ4L2H0Rpaf1gJNaeUoLWBUgrGbKi1YqJj3y3m+cVgaibA7XaDlAo5k+STwe/3G4/HA855OOc4/9uYUAiB6/XKZDT1OA6WSDWKY6N1XXG/33lbwg1CepLhEzUQ8HK58FQi+c8G4fP55EbanAb/IaQkEWmtefLY6CeIf781/oB5nlkJEY7aiCT5C2vJHqz3bDvdAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;datamodel&quot;
        title=&quot;datamodel&quot;
        src=&quot;/static/be48b76da8b8c07704ecc92e734180e1/00d43/datamodel.png&quot;
        srcset=&quot;/static/be48b76da8b8c07704ecc92e734180e1/63868/datamodel.png 250w,
/static/be48b76da8b8c07704ecc92e734180e1/0b533/datamodel.png 500w,
/static/be48b76da8b8c07704ecc92e734180e1/00d43/datamodel.png 1000w,
/static/be48b76da8b8c07704ecc92e734180e1/d8104/datamodel.png 1365w&quot;
        sizes=&quot;(max-width: 1000px) 100vw, 1000px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The main table, called “Prices,” holds historical price data. We also maintain details about the products tracked in the “products” table and analytics/recommendations related data, such as “prod_ranking” and “KPI” tables.&lt;/p&gt;
&lt;h3&gt;Generating the best deals:&lt;/h3&gt;
&lt;p&gt;To generate the best deals, we calculate the average price of a particular product and compare it to its actual price today, getting the percent difference. For example, in the picture below:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/1000/1*Y3ybuYRFObrUQ_jaUvNGeQ.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;The product is down 27.62% from its average price. To further enhance recommendations, we prioritize popular products with the highest number of reviews by category.&lt;/p&gt;
&lt;h3&gt;Deep Dive Into Scraping:&lt;/h3&gt;
&lt;p&gt;The first step is to get URLs of the categories, as shown below:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;./pictures/scraping1&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Now each category has multiple pages, and we use the page number as a variable to navigate and grab products on each page.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;./pictures/scraping0&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Below is the full scraping code, it utilizes python’s request module, beautifulsoup to parse html and and tqdm for multithreading which accelerates the task. To learn more about scraping I’d recommend my &lt;a href=&quot;https://medium.com/analytics-vidhya/every-data-scientist-needs-to-learn-this-4632e3a2e275&quot;&gt;article&lt;/a&gt; or similar content.&lt;/p&gt;
&lt;p&gt;&lt;div id=&quot;gist118835501&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot; translate=&quot;no&quot; data-color-mode=&quot;light&quot; data-light-theme=&quot;light&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        
&lt;div class=&quot;js-gist-file-update-container js-task-list-container&quot;&gt;
      &lt;div id=&quot;file-scrapejumia-py&quot; class=&quot;file my-2&quot;&gt;
    
    &lt;div itemprop=&quot;text&quot;
      class=&quot;Box-body p-0 blob-wrapper data type-python  &quot;
      style=&quot;overflow: auto&quot; tabindex=&quot;0&quot; role=&quot;region&quot;
      aria-label=&quot;scrapeJumia.py content, created by tariqmassaoudi on 01:49PM on October 13, 2022.&quot;
    &gt;

        
&lt;div class=&quot;js-check-hidden-unicode js-blob-code-container blob-code-content&quot;&gt;

  &lt;template class=&quot;js-file-alert-template&quot;&gt;
  &lt;div data-view-component=&quot;true&quot; class=&quot;flash flash-warn flash-full d-flex flex-items-center&quot;&gt;
  &lt;svg aria-hidden=&quot;true&quot; height=&quot;16&quot; viewBox=&quot;0 0 16 16&quot; version=&quot;1.1&quot; width=&quot;16&quot; data-view-component=&quot;true&quot; class=&quot;octicon octicon-alert&quot;&gt;
    &lt;path d=&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z&quot;&gt;&lt;/path&gt;
&lt;/svg&gt;
    &lt;span&gt;
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      &lt;a class=&quot;Link--inTextBlock&quot; href=&quot;https://github.co/hiddenchars&quot; target=&quot;_blank&quot;&gt;Learn more about bidirectional Unicode characters&lt;/a&gt;
    &lt;/span&gt;


  &lt;div data-view-component=&quot;true&quot; class=&quot;flash-action&quot;&gt;        &lt;a href=&quot;{{ revealButtonHref }}&quot; data-view-component=&quot;true&quot; class=&quot;btn-sm btn&quot;&gt;    Show hidden characters
&lt;/a&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;/template&gt;
&lt;template class=&quot;js-line-alert-template&quot;&gt;
  &lt;span aria-label=&quot;This line has hidden Unicode characters&quot; data-view-component=&quot;true&quot; class=&quot;line-alert tooltipped tooltipped-e&quot;&gt;
    &lt;svg aria-hidden=&quot;true&quot; height=&quot;16&quot; viewBox=&quot;0 0 16 16&quot; version=&quot;1.1&quot; width=&quot;16&quot; data-view-component=&quot;true&quot; class=&quot;octicon octicon-alert&quot;&gt;
    &lt;path d=&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z&quot;&gt;&lt;/path&gt;
&lt;/svg&gt;
&lt;/span&gt;&lt;/template&gt;

  &lt;table data-hpc class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;4&quot; data-paste-markdown-skip data-tagsearch-path=&quot;scrapeJumia.py&quot;&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L1&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# coding=utf-8&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L2&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;from bs4 import BeautifulSoup&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L3&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;import requests&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L4&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;from tqdm import tqdm&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L5&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;from datetime import datetime&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L6&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;import pandas as pd&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L7&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;from tqdm.contrib.concurrent import thread_map&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L8&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;8&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC8&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L9&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;9&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC9&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;outfile=&amp;#39;/opt/airflow/dags/jumia_data&amp;#39;+str(datetime.today().strftime(&amp;#39;%Y-%m-%d&amp;#39;))+&amp;#39;.csv&amp;#39;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L10&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;10&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC10&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L11&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;11&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC11&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L12&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;12&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC12&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;def process_article(article):&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L13&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;13&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC13&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    dataid=article.find(&amp;#39;a&amp;#39;).get(&amp;#39;data-id&amp;#39;)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L14&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;14&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC14&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    href=article.find(&amp;#39;a&amp;#39;).get(&amp;#39;href&amp;#39;)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L15&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;15&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC15&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    category=article.find(&amp;#39;a&amp;#39;).get(&amp;#39;data-category&amp;#39;)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L16&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;16&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC16&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    name=article.find(&amp;#39;a&amp;#39;).get(&amp;#39;data-name&amp;#39;)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L17&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;17&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC17&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    price=article.find(class_=&amp;#39;prc&amp;#39;).text&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L18&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;18&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC18&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    stars=article.find(class_=&amp;#39;stars _s&amp;#39;).text if article.find(class_=&amp;#39;stars _s&amp;#39;) else None&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L19&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;19&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC19&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    reviewcount=article.find(class_=&amp;#39;rev&amp;#39;).text if article.find(class_=&amp;#39;stars _s&amp;#39;) else None&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L20&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;20&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC20&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    brand=article.find(&amp;#39;a&amp;#39;).get(&amp;#39;data-brand&amp;#39;)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L21&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;21&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC21&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    discount=article.find(class_=&amp;#39;bdg _dsct _sm&amp;#39;).text if article.find(class_=&amp;#39;bdg _dsct _sm&amp;#39;) else False&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L22&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;22&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC22&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    boutiqueOfficielle=True if article.find(class_=&amp;#39;bdg _mall _xs&amp;#39;) else False&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L23&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;23&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC23&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    etranger=True if article.find(class_=&amp;#39;bdg _glb _xs&amp;#39;) else False&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L24&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;24&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC24&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    fastDelivery=True if article.find(class_=&amp;#39;shipp&amp;#39;) else False&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L25&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;25&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC25&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    image=article.find(class_=&amp;#39;img&amp;#39;).get(&amp;#39;data-src&amp;#39;)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L26&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;26&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC26&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    return {&amp;#39;reviewcount&amp;#39;:reviewcount,&amp;#39;img_url&amp;#39;:image,&amp;#39;id&amp;#39;:dataid,&amp;#39;href&amp;#39;:href,&amp;#39;name&amp;#39;:name,&amp;#39;category&amp;#39;:category,&amp;#39;brand&amp;#39;:brand,&amp;#39;price&amp;#39;:price,&amp;#39;stars&amp;#39;:stars,&amp;#39;discount&amp;#39;:discount,&amp;#39;boutiqueOfficielle&amp;#39;:boutiqueOfficielle,&amp;#39;etranger&amp;#39;:etranger,&amp;#39;fastDelivery&amp;#39;:fastDelivery,&amp;#39;timestamp&amp;#39;:datetime.today().strftime(&amp;#39;%Y-%m-%d %H:%M:%S&amp;#39;)}&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L27&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;27&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC27&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L28&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;28&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC28&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;def process_page(url):&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L29&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;29&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC29&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    page = requests.get(url)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L30&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;30&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC30&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    soup = BeautifulSoup(page.text, &amp;#39;html.parser&amp;#39;)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L31&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;31&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC31&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    PageData=[process_article(article) for article in soup.find_all(name=&amp;#39;article&amp;#39;,attrs={&amp;#39;class&amp;#39;:&amp;#39;prd _fb col c-prd&amp;#39;})]&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L32&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;32&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC32&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    PagaDataTable=pd.DataFrame(PageData)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L33&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;33&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC33&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L34&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;34&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC34&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    PagaDataTable.to_csv(outfile, mode=&amp;#39;a&amp;#39;, index=False,header=False, encoding=&amp;quot;utf-8&amp;quot;)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L35&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;35&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC35&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L36&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;36&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC36&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;def process_sub_category(url):&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L37&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;37&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC37&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    page = requests.get(url)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L38&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;38&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC38&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    soup = BeautifulSoup(page.text, &amp;#39;html.parser&amp;#39;)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L39&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;39&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC39&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    try:&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L40&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;40&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC40&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;        numPagesToScrape=soup.find(name=&amp;#39;a&amp;#39;,attrs={&amp;#39;class&amp;#39;:&amp;#39;pg&amp;#39;,&amp;#39;aria-label&amp;#39;:&amp;#39;Dernière page&amp;#39;}).get(&amp;#39;href&amp;#39;).split(&amp;quot;page=&amp;quot;)[1].split(&amp;quot;#&amp;quot;)[0]&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L41&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;41&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC41&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    except:&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L42&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;42&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC42&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;        numPagesToScrape=&amp;#39;1&amp;#39;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L43&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;43&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC43&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    urls=[url+&amp;#39;?page=&amp;#39;+str(i) for i in range(int(numPagesToScrape)+1)]&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L44&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;44&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC44&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    thread_map(process_page,urls,max_workers=32)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L45&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;45&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC45&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L46&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;46&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC46&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L47&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;47&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC47&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;def start_scrape():&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L48&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;48&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC48&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    subCategories=pd.read_csv(&amp;#39;/opt/airflow/dags/subCategoriesHrefs.csv&amp;#39;)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L49&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;49&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC49&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    subCategories.href=subCategories.href.apply(lambda s: str(s).split(&amp;quot;?shipped_from=country_local&amp;quot;)[0])&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L50&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;50&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC50&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    pd.DataFrame({&amp;#39;reviewcount&amp;#39;:[],&amp;#39;img_url&amp;#39;:[],&amp;#39;id&amp;#39;:[],&amp;#39;href&amp;#39;:[],&amp;#39;name&amp;#39;:[],&amp;#39;category&amp;#39;:[],&amp;#39;brand&amp;#39;:[],&amp;#39;price&amp;#39;:[],&amp;#39;stars&amp;#39;:[],&amp;#39;discount&amp;#39;:[],&amp;#39;boutiqueOfficielle&amp;#39;:[],&amp;#39;etranger&amp;#39;:[],&amp;#39;fastDelivery&amp;#39;:[],&amp;#39;timestamp&amp;#39;:[]}).to_csv(outfile,index=False)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L51&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;51&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC51&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    for subCategory in tqdm(subCategories.href):&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L52&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;52&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC52&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;            process_sub_category(subCategory)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L53&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;53&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC53&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L54&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;54&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC54&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L55&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;55&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC55&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;if __name__ == &amp;quot;__main__&amp;quot;:&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-scrapejumia-py-L56&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;56&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-scrapejumia-py-LC56&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    start_scrape()&lt;/td&gt;
        &lt;/tr&gt;
  &lt;/table&gt;
&lt;/div&gt;


    &lt;/div&gt;

  &lt;/div&gt;

&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/tariqmassaoudi/5152eae7e2b8ba384e9a0279e5b2b43e/raw/6b04c88a0ad389e2406c668da7ebadf6ff469bba/scrapeJumia.py&quot; style=&quot;float:right&quot; class=&quot;Link--inTextBlock&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/tariqmassaoudi/5152eae7e2b8ba384e9a0279e5b2b43e#file-scrapejumia-py&quot; class=&quot;Link--inTextBlock&quot;&gt;
          scrapeJumia.py
        &lt;/a&gt;
        hosted with &amp;#10084; by &lt;a class=&quot;Link--inTextBlock&quot; href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/p&gt;
&lt;!-- &lt;script src=&quot;https://gist.github.com/tariqmassaoudi/5152eae7e2b8ba384e9a0279e5b2b43e.js&quot;&gt;&lt;/script&gt; --&gt;
&lt;h3&gt;Airflow: A Powerful Task Scheduling Platform:&lt;/h3&gt;
&lt;p&gt;Airflow is a robust platform that enables users to create and run workflows using Directed Acyclic Graphs (DAGs) and tasks with dependencies and data flows taken into account. With Airflow, users can specify the order of execution and run retries as well as describe what to do with each task, such as fetching data, running analysis, triggering other systems, and more.&lt;/p&gt;
&lt;p&gt;One of the most significant advantages of using Airflow is its user-friendly graphical interface, which allows you to track the progress of your tasks in real-time, while also providing built-in retry on failure and integration with most popular databases. Moreover, it stores the execution times and logs, making it incredibly useful for debugging.&lt;/p&gt;
&lt;p&gt;To learn more about Airflow, check out the &lt;a href=&quot;https://airflow.apache.org/docs/&quot;&gt;official documentation&lt;/a&gt;, which is the best place to get started.&lt;/p&gt;
&lt;p&gt;Below is the DAG used in the project, along with the main Python code used to generate it:&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 1000px; &quot;
    &gt;
      &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 27.599999999999998%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAGCAYAAADDl76dAAAACXBIWXMAAA7DAAAOwwHHb6hkAAAA9UlEQVQY012Qi26DMAxF+f//2wYDRni3pCF2QnmoHXdKUia2SNY9vk4i2xFOZ993r8/HNyzNWAcJYwhyk7CzxmYmTMSQ9ga1zFDzHcuq8FwltumKx6IRWWthjPkNa4MSEZi0V836lZNXcj4HZibI4YpL3+EmJSLSBGLjg09q2IKNBbMNzCc2f3OtGeNIUGpEJNWAvsvRNxnKNkXf5qibFHWb4dJk6Nrc+10TaoEzz93xxt3rC99xJJVEVSUoi3cUdYKyjCGqGKIOnsuL5hPCsXCceK5EDCE+fK34ekPVpNDuQ7+T16iu/WPkY/yz/59DflqXJvwAblbECbCIhHEAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;airflow&quot;
        title=&quot;airflow&quot;
        src=&quot;/static/a08dcb4fc26cbd68f02f1dcb9392e629/00d43/airflow.png&quot;
        srcset=&quot;/static/a08dcb4fc26cbd68f02f1dcb9392e629/63868/airflow.png 250w,
/static/a08dcb4fc26cbd68f02f1dcb9392e629/0b533/airflow.png 500w,
/static/a08dcb4fc26cbd68f02f1dcb9392e629/00d43/airflow.png 1000w,
/static/a08dcb4fc26cbd68f02f1dcb9392e629/33c15/airflow.png 1463w&quot;
        sizes=&quot;(max-width: 1000px) 100vw, 1000px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
    &lt;/span&gt;
&lt;div id=&quot;gist118836196&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot; translate=&quot;no&quot; data-color-mode=&quot;light&quot; data-light-theme=&quot;light&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        
&lt;div class=&quot;js-gist-file-update-container js-task-list-container&quot;&gt;
      &lt;div id=&quot;file-jumiadag-py&quot; class=&quot;file my-2&quot;&gt;
    
    &lt;div itemprop=&quot;text&quot;
      class=&quot;Box-body p-0 blob-wrapper data type-python  &quot;
      style=&quot;overflow: auto&quot; tabindex=&quot;0&quot; role=&quot;region&quot;
      aria-label=&quot;jumiaDag.py content, created by tariqmassaoudi on 02:23PM on October 13, 2022.&quot;
    &gt;

        
&lt;div class=&quot;js-check-hidden-unicode js-blob-code-container blob-code-content&quot;&gt;

  &lt;template class=&quot;js-file-alert-template&quot;&gt;
  &lt;div data-view-component=&quot;true&quot; class=&quot;flash flash-warn flash-full d-flex flex-items-center&quot;&gt;
  &lt;svg aria-hidden=&quot;true&quot; height=&quot;16&quot; viewBox=&quot;0 0 16 16&quot; version=&quot;1.1&quot; width=&quot;16&quot; data-view-component=&quot;true&quot; class=&quot;octicon octicon-alert&quot;&gt;
    &lt;path d=&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z&quot;&gt;&lt;/path&gt;
&lt;/svg&gt;
    &lt;span&gt;
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      &lt;a class=&quot;Link--inTextBlock&quot; href=&quot;https://github.co/hiddenchars&quot; target=&quot;_blank&quot;&gt;Learn more about bidirectional Unicode characters&lt;/a&gt;
    &lt;/span&gt;


  &lt;div data-view-component=&quot;true&quot; class=&quot;flash-action&quot;&gt;        &lt;a href=&quot;{{ revealButtonHref }}&quot; data-view-component=&quot;true&quot; class=&quot;btn-sm btn&quot;&gt;    Show hidden characters
&lt;/a&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;/template&gt;
&lt;template class=&quot;js-line-alert-template&quot;&gt;
  &lt;span aria-label=&quot;This line has hidden Unicode characters&quot; data-view-component=&quot;true&quot; class=&quot;line-alert tooltipped tooltipped-e&quot;&gt;
    &lt;svg aria-hidden=&quot;true&quot; height=&quot;16&quot; viewBox=&quot;0 0 16 16&quot; version=&quot;1.1&quot; width=&quot;16&quot; data-view-component=&quot;true&quot; class=&quot;octicon octicon-alert&quot;&gt;
    &lt;path d=&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z&quot;&gt;&lt;/path&gt;
&lt;/svg&gt;
&lt;/span&gt;&lt;/template&gt;

  &lt;table data-hpc class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;4&quot; data-paste-markdown-skip data-tagsearch-path=&quot;jumiaDag.py&quot;&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L1&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;import airflow&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L2&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;from airflow import DAG&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L3&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;from airflow.operators.python_operator import PythonOperator&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L4&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;from datetime import timedelta&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L5&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;import sys, os&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L6&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L7&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;sys.path.insert(1, &amp;#39;/opt/airflow/dags/scripts&amp;#39;)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L8&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;8&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC8&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;from scrapeJumia import *&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L9&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;9&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC9&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;from updateProducts import *&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L10&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;10&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC10&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;from updatePrices import *&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L11&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;11&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC11&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;from updateProdRanking import *&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L12&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;12&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC12&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;from updateKpi import *&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L13&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;13&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC13&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;from uploadS3 import *&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L14&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;14&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC14&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L15&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;15&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC15&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;default_args = {&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L16&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;16&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC16&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &amp;#39;owner&amp;#39;: &amp;#39;airflow&amp;#39;,&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L17&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;17&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC17&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &amp;#39;depends_on_past&amp;#39;: False,&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L18&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;18&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC18&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &amp;#39;start_date&amp;#39;: airflow.utils.dates.days_ago(2),&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L19&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;19&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC19&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &amp;#39;email&amp;#39;: [&amp;#39;youremail@gmail.com&amp;#39;],&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L20&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;20&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC20&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &amp;#39;email_on_failure&amp;#39;: True,&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L21&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;21&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC21&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &amp;#39;email_on_retry&amp;#39;: False,&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L22&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;22&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC22&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &amp;#39;retries&amp;#39;: 1,&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L23&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;23&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC23&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &amp;#39;retry_delay&amp;#39;: timedelta(minutes=1)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L24&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;24&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC24&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L25&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;25&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC25&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;}&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L26&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;26&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC26&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;dag_python = DAG(&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L27&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;27&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC27&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;	dag_id = &amp;quot;jumia_python&amp;quot;,&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L28&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;28&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC28&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;	default_args=default_args,&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L29&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;29&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC29&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    description=&amp;#39;Dag that srapes from jumia and updates a postgres database in RDS&amp;#39;,&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L30&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;30&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC30&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    schedule_interval=&amp;#39;10 0 * * *&amp;#39;, &lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L31&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;31&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC31&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    catchup=False&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L32&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;32&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC32&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    )&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L33&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;33&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC33&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;scrape_jumia = PythonOperator(task_id=&amp;#39;scrape_jumia&amp;#39;, python_callable=start_scrape, dag=dag_python)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L34&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;34&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC34&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;update_products = PythonOperator(task_id=&amp;#39;update_products&amp;#39;, python_callable=start_update_products, dag=dag_python)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L35&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;35&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC35&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;update_prices = PythonOperator(task_id=&amp;#39;update_prices&amp;#39;, python_callable=start_update_prices, dag=dag_python)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L36&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;36&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC36&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;update_prod_ranking = PythonOperator(task_id=&amp;#39;update_prod_ranking&amp;#39;, python_callable=start_update_prod_ranking, dag=dag_python)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L37&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;37&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC37&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;update_kpi = PythonOperator(task_id=&amp;#39;update_kpi&amp;#39;, python_callable=start_update_kpi, dag=dag_python)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L38&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;38&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC38&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;upload_s3 = PythonOperator(task_id=&amp;#39;upload_s3&amp;#39;, python_callable=start_upload_s3, dag=dag_python)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L39&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;39&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC39&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;scrape_jumia &amp;gt;&amp;gt; upload_s3 &amp;gt;&amp;gt; update_products &amp;gt;&amp;gt; update_prices &amp;gt;&amp;gt; update_prod_ranking &amp;gt;&amp;gt; update_kpi&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L40&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;40&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC40&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L41&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;41&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC41&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L42&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;42&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC42&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-jumiadag-py-L43&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;43&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-jumiadag-py-LC43&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
  &lt;/table&gt;
&lt;/div&gt;


    &lt;/div&gt;

  &lt;/div&gt;

&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/tariqmassaoudi/eb5d1310e21b9a0d1501067fe702f4d3/raw/d1e1f1aa4ee397afe00403fc5219731b72bfb2d7/jumiaDag.py&quot; style=&quot;float:right&quot; class=&quot;Link--inTextBlock&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/tariqmassaoudi/eb5d1310e21b9a0d1501067fe702f4d3#file-jumiadag-py&quot; class=&quot;Link--inTextBlock&quot;&gt;
          jumiaDag.py
        &lt;/a&gt;
        hosted with &amp;#10084; by &lt;a class=&quot;Link--inTextBlock&quot; href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/p&gt;
&lt;h3&gt;AWS Lambda: A Serverless Backend Solution:&lt;/h3&gt;
&lt;p&gt;Lambda functions are incredibly flexible and can be used for a wide range of applications. In this project, they were used as a REST API to offload the workload from the main EC2 server. It’s easy to get started with Lambda, simply choose your preferred language and start a function from scratch or use a container or one of the provided AWS blueprints.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 1000px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/0e7ffb25cabdb079d7fd242093c6309e/4969b/lambdacreate.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 9.2%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAACCAYAAABYBvyLAAAACXBIWXMAAA7DAAAOwwHHb6hkAAAAZ0lEQVQI11WMwQ6DMAxD+f+P3GVM2iS60CZxoCB5IuMAB0v2k+3h8XzxXWZ+m1EMFHU2VZr7TQdTs/QOEEDmg1/7w1gaPxWcvLNg46SgiOQAEQSCsSz/Q1VGBNfeue17Hs+1pvzs/wCPZJnxbErsOQAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;lambdacreate&quot;
        title=&quot;lambdacreate&quot;
        src=&quot;/static/0e7ffb25cabdb079d7fd242093c6309e/00d43/lambdacreate.png&quot;
        srcset=&quot;/static/0e7ffb25cabdb079d7fd242093c6309e/63868/lambdacreate.png 250w,
/static/0e7ffb25cabdb079d7fd242093c6309e/0b533/lambdacreate.png 500w,
/static/0e7ffb25cabdb079d7fd242093c6309e/00d43/lambdacreate.png 1000w,
/static/0e7ffb25cabdb079d7fd242093c6309e/aa440/lambdacreate.png 1500w,
/static/0e7ffb25cabdb079d7fd242093c6309e/e8950/lambdacreate.png 2000w,
/static/0e7ffb25cabdb079d7fd242093c6309e/4969b/lambdacreate.png 2454w&quot;
        sizes=&quot;(max-width: 1000px) 100vw, 1000px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Once you’ve created your function, you’ll need to set it up for your use case. In my experience, this includes setting up “layers,” which allow your function to use external libraries such as pandas and sqlalchemy. You’ll also need to set up the REST API to call the function from the web, enabling CORS (Cross-Origin Resource Sharing) to allow calls from your browser. The AWS &lt;a href=&quot;https://docs.aws.amazon.com/apigateway/latest/developerguide/how-to-cors.html&quot;&gt;documentation&lt;/a&gt; does an excellent job of explaining this.&lt;/p&gt;
&lt;p&gt;After setting up your Lambda function, you’ll have a function with layers and an API gateway:&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 1000px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/abc97ddbbb31fafa99358139b4c56614/b6e34/lambdaoverview.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 30.4%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAGCAYAAADDl76dAAAACXBIWXMAAA7DAAAOwwHHb6hkAAAA2UlEQVQY022RS24EIQxEuf9xss4RImWT/aySUX9p/tg0XRHOMGmNevEEsqpsUygiwitcKsjOSD9fcCHCOQdjLZgZ+77jytNRzCTCDlEGVyAON6yfb1ithzEG22ZQaxVk6MlXShHaXVFm5JSRc34IGYUZIREmE2Gtla32WsXkfUDO9BhOCCHAOQ/vvfRQH+8Lvm8aq15gjMVxHM9NcFRpllIScWv0jIUZmQhabxinCfOyIMYENQ4a9/uAcZ6l0ER/5n/OkfTNzmd/smTYMosxCl1wxeXnvdRbhr+Bz9KvyF8oeQAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;lambdaoverview&quot;
        title=&quot;lambdaoverview&quot;
        src=&quot;/static/abc97ddbbb31fafa99358139b4c56614/00d43/lambdaoverview.png&quot;
        srcset=&quot;/static/abc97ddbbb31fafa99358139b4c56614/63868/lambdaoverview.png 250w,
/static/abc97ddbbb31fafa99358139b4c56614/0b533/lambdaoverview.png 500w,
/static/abc97ddbbb31fafa99358139b4c56614/00d43/lambdaoverview.png 1000w,
/static/abc97ddbbb31fafa99358139b4c56614/b6e34/lambdaoverview.png 1103w&quot;
        sizes=&quot;(max-width: 1000px) 100vw, 1000px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;To enable your function to communicate with your RDS database, you’ll need to connect it to a VPC in the same subnets as your RDS setup and create a “security group” that allows connection on the Postgres port 5432 and assign it to the function:&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 1000px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/fb15ca3779e21dc387e99f08d102d553/b5a09/lambdavpc.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 42%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAICAYAAAD5nd/tAAAACXBIWXMAAA7DAAAOwwHHb6hkAAAA+0lEQVQoz3VRiU7FMAzr/38oEqA3bT3SJD2M3DF4DKhkNakT52jYY0IUhZpDVVFrRe8drTdIVag3qDWY2YK7o1ZFLBXVTptvrTVUVQTzBmsD5n0RTJpzLrw9drw+DqTqKCLIpayizIliEOsw89UAT+sd4XLuZwluB17edxxiSCkjxoRSBP8ddhmaO6Yrej/HGmN8BbDYXwXXBE/2D0EpBXnfIFLWWBQlceIUpO2faHdcvDvUDEFqRcoFvK8dPoNF+FGXzx3KZ+G7zbgluG0b9uNY1ThyJ/rAmHN9BEVpk2M3FOGobIC7ZTw5+iGmtJJSzqB9B9/v3Lf/m/sAx11x80zuUPUAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;lambdavpc&quot;
        title=&quot;lambdavpc&quot;
        src=&quot;/static/fb15ca3779e21dc387e99f08d102d553/00d43/lambdavpc.png&quot;
        srcset=&quot;/static/fb15ca3779e21dc387e99f08d102d553/63868/lambdavpc.png 250w,
/static/fb15ca3779e21dc387e99f08d102d553/0b533/lambdavpc.png 500w,
/static/fb15ca3779e21dc387e99f08d102d553/00d43/lambdavpc.png 1000w,
/static/fb15ca3779e21dc387e99f08d102d553/b5a09/lambdavpc.png 1360w&quot;
        sizes=&quot;(max-width: 1000px) 100vw, 1000px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Here’s an example of a function that gets product details given a product ID or URL:&lt;/p&gt;
&lt;p&gt;&lt;div id=&quot;gist118837143&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot; translate=&quot;no&quot; data-color-mode=&quot;light&quot; data-light-theme=&quot;light&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        
&lt;div class=&quot;js-gist-file-update-container js-task-list-container&quot;&gt;
      &lt;div id=&quot;file-getproduct-py&quot; class=&quot;file my-2&quot;&gt;
    
    &lt;div itemprop=&quot;text&quot;
      class=&quot;Box-body p-0 blob-wrapper data type-python  &quot;
      style=&quot;overflow: auto&quot; tabindex=&quot;0&quot; role=&quot;region&quot;
      aria-label=&quot;getProduct.py content, created by tariqmassaoudi on 03:15PM on October 13, 2022.&quot;
    &gt;

        
&lt;div class=&quot;js-check-hidden-unicode js-blob-code-container blob-code-content&quot;&gt;

  &lt;template class=&quot;js-file-alert-template&quot;&gt;
  &lt;div data-view-component=&quot;true&quot; class=&quot;flash flash-warn flash-full d-flex flex-items-center&quot;&gt;
  &lt;svg aria-hidden=&quot;true&quot; height=&quot;16&quot; viewBox=&quot;0 0 16 16&quot; version=&quot;1.1&quot; width=&quot;16&quot; data-view-component=&quot;true&quot; class=&quot;octicon octicon-alert&quot;&gt;
    &lt;path d=&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z&quot;&gt;&lt;/path&gt;
&lt;/svg&gt;
    &lt;span&gt;
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      &lt;a class=&quot;Link--inTextBlock&quot; href=&quot;https://github.co/hiddenchars&quot; target=&quot;_blank&quot;&gt;Learn more about bidirectional Unicode characters&lt;/a&gt;
    &lt;/span&gt;


  &lt;div data-view-component=&quot;true&quot; class=&quot;flash-action&quot;&gt;        &lt;a href=&quot;{{ revealButtonHref }}&quot; data-view-component=&quot;true&quot; class=&quot;btn-sm btn&quot;&gt;    Show hidden characters
&lt;/a&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;/template&gt;
&lt;template class=&quot;js-line-alert-template&quot;&gt;
  &lt;span aria-label=&quot;This line has hidden Unicode characters&quot; data-view-component=&quot;true&quot; class=&quot;line-alert tooltipped tooltipped-e&quot;&gt;
    &lt;svg aria-hidden=&quot;true&quot; height=&quot;16&quot; viewBox=&quot;0 0 16 16&quot; version=&quot;1.1&quot; width=&quot;16&quot; data-view-component=&quot;true&quot; class=&quot;octicon octicon-alert&quot;&gt;
    &lt;path d=&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z&quot;&gt;&lt;/path&gt;
&lt;/svg&gt;
&lt;/span&gt;&lt;/template&gt;

  &lt;table data-hpc class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;4&quot; data-paste-markdown-skip data-tagsearch-path=&quot;getProduct.py&quot;&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L1&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;import json&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L2&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;from sqlalchemy import create_engine&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L3&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;import pandas as pd&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L4&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;def lambda_handler(event, context):&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L5&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    engine = create_engine(&amp;#39;postgresql://postgres:password!@host:5432/database&amp;#39;)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L6&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L7&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;  &lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L8&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;8&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC8&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    try:&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L9&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;9&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC9&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;        product_id=json.loads(event[&amp;#39;body&amp;#39;])[&amp;quot;prod_id&amp;quot;]&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L10&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;10&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC10&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;        sql=&amp;quot;SELECT * FROM products where id=&amp;#39;&amp;quot;+product_id+&amp;quot;&amp;#39;&amp;quot;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L11&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;11&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC11&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    except:&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L12&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;12&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC12&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L13&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;13&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC13&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;        href_full=json.loads(event[&amp;#39;body&amp;#39;])[&amp;quot;href&amp;quot;]&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L14&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;14&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC14&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;        if href_full.find(&amp;quot;www&amp;quot;)&amp;gt;0:&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L15&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;15&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC15&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;            href=json.loads(event[&amp;#39;body&amp;#39;])[&amp;quot;href&amp;quot;].split(&amp;quot;https://www.jumia.ma&amp;quot;)[1]&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L16&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;16&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC16&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;            if href.find(&amp;quot;?&amp;quot;)&amp;gt;0:&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L17&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;17&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC17&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;                href=href.split(&amp;quot;?&amp;quot;)[0]&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L18&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;18&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC18&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;        else:&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L19&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;19&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC19&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;            href=json.loads(event[&amp;#39;body&amp;#39;])[&amp;quot;href&amp;quot;].split(&amp;quot;https://jumia.ma&amp;quot;)[1]&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L20&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;20&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC20&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;            if href.find(&amp;quot;?&amp;quot;)&amp;gt;0:&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L21&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;21&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC21&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;                href=href.split(&amp;quot;?&amp;quot;)[0]&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L22&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;22&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC22&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;        sql=&amp;quot;SELECT * FROM products where href=&amp;#39;&amp;quot;+href+&amp;quot;&amp;#39;&amp;quot;&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L23&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;23&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC23&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    product=pd.read_sql(sql,con=engine)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L24&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;24&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC24&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;  &lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L25&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;25&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC25&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    return {&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L26&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;26&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC26&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;        &amp;#39;headers&amp;#39;: {&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L27&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;27&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC27&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;            &amp;#39;Content-Type&amp;#39;: &amp;#39;application/json&amp;#39;,&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L28&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;28&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC28&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;            &amp;#39;Access-Control-Allow-Origin&amp;#39;: &amp;#39;*&amp;#39;,&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L29&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;29&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC29&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;            &amp;#39;Access-Control-Allow-Headers&amp;#39;: &amp;#39;Authorization,Content-Type&amp;#39;,&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L30&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;30&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC30&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;            &amp;#39;Access-Control-Allow-Method&amp;#39;: &amp;#39;GET,POST,OPTIONS&amp;#39;,&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L31&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;31&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC31&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    },&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L32&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;32&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC32&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;        &amp;#39;statusCode&amp;#39;: 200,&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L33&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;33&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC33&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;        &amp;#39;body&amp;#39;: json.dumps(product.to_dict(orient=&amp;#39;records&amp;#39;)[0])&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L34&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;34&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC34&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;            }&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
          &lt;td id=&quot;file-getproduct-py-L35&quot; class=&quot;blob-num js-line-number js-blob-rnum&quot; data-line-number=&quot;35&quot;&gt;&lt;/td&gt;
          &lt;td id=&quot;file-getproduct-py-LC35&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
        &lt;/tr&gt;
  &lt;/table&gt;
&lt;/div&gt;


    &lt;/div&gt;

  &lt;/div&gt;

&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/tariqmassaoudi/0c5c7a75923a9124f329ea49c46c2b46/raw/f9d382dacfc874ff7eb9f471fd55e9af858840b2/getProduct.py&quot; style=&quot;float:right&quot; class=&quot;Link--inTextBlock&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/tariqmassaoudi/0c5c7a75923a9124f329ea49c46c2b46#file-getproduct-py&quot; class=&quot;Link--inTextBlock&quot;&gt;
          getProduct.py
        &lt;/a&gt;
        hosted with &amp;#10084; by &lt;a class=&quot;Link--inTextBlock&quot; href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/p&gt;
&lt;h3&gt;Conclusion:&lt;/h3&gt;
&lt;p&gt;It was an exciting and fulfilling experience working on this project, as it has real-world applications for the average person. AWS’s free tier offers a generous package, making it ideal for prototyping compared to the competition. As long as you use it efficiently and do not exceed the limits, you can host almost any project.&lt;/p&gt;
&lt;p&gt;Thank you for reading this article. We hope you found it informative and learned something new. If you have any questions or would like to discuss further, feel free to reach out on LinkedIn: &lt;a href=&quot;https://www.linkedin.com/in/tariqmassaoudi/&quot;&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Every Data Scientist Needs To Learn This]]></title><description><![CDATA[Photo by Rock’n Roll Monkey on Unsplash Ever had the idea of this amazing data science project, you look up the data you’ll need online but…]]></description><link>https://www.tariqmassaoudi.com/every-data-scientist-needs-to-learn-this/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/every-data-scientist-needs-to-learn-this/</guid><pubDate>Sun, 25 Oct 2020 22:40:32 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/1400/0*4Zw7_m7VVmtjAAhT&quot; alt=&quot;&quot;&gt;Photo by &lt;a href=&quot;https://unsplash.com/@rocknrollmonkey?utm_source=medium&amp;#x26;utm_medium=referral&quot;&gt;Rock’n Roll Monkey&lt;/a&gt; on &lt;a href=&quot;https://unsplash.com?utm_source=medium&amp;#x26;utm_medium=referral&quot;&gt;Unsplash&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Ever had the idea of this amazing data science project, you look up the data you’ll need online but sadly it’s nowhere to be found? Unfortunately, not every dataset you’ll ever need is online. So, what should you do? Abandon your idea and go back to kaggle? No! A real data scientist should be able to collect his own DATA!&lt;/p&gt;
&lt;h1&gt;What’s Web Scraping and why learn it?&lt;/h1&gt;
&lt;p&gt;The web is the single biggest resource for data, it’s a literal archive for human knowledge at least for the last 20 years. Web Scraping is the art of extracting that data off the web, as a Data Scientist It is such a handy tool and opens so many doors to cool projects.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note that some websites prohibit scraping and might ban your IP address if you scrape too frequently or maliciously.&lt;/strong&gt;&lt;/p&gt;
&lt;h1&gt;&lt;strong&gt;How do we scrape?&lt;/strong&gt;&lt;/h1&gt;
&lt;p&gt;There are two approaches when it comes to web scraping.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Request based scraping&lt;/strong&gt;: With this approach we will be sending a request to the website’s server which will return the HTML of the page which is the same content that you find when you click “View page source” on google chrome, you can try that out right now by pressing &lt;strong&gt;ctrl+u&lt;/strong&gt; .Then we will typically use a library to parse the HTML and extract the data that we want. This approach is simple, lightweight and very fast, however it’s not perfect and there’s one drawback that might put you off using it, in fact most modern websites nowadays use JavaScript to render their content, IE: you don’t see the content of the page until after the JavaScript executes which the request method can’t handle.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Browser based scraping&lt;/strong&gt;: To execute JavaScript we need a fully-fledged browser, this is what this method is about, we will simulate a browser, navigate to the page we want, wait for JavaScript to execute and we can even interact with the page by clicking buttons, filling forms… Then just look at the HTML state and extract the data. This approach is very flexible, you can pretty much scrape any website you want, however it’s much slower and resource intensive than just sending a request.&lt;/p&gt;
&lt;h1&gt;Scrape anything with selenium:&lt;/h1&gt;
&lt;p&gt;Selenium is widely used library for web automation, but you can actually use it for scraping too! Basically any task that a human can manually do, you’ll be able to simulate it with selenium, you can create a bot that will perform certain action when something happens, or you can make selenium browse web pages and scrape data for you which is what we’ll be doing in this article.&lt;/p&gt;
&lt;p&gt;To parse the HTML we will be using beautiful soup.&lt;/p&gt;
&lt;p&gt;For further reading here are documentation links for &lt;a href=&quot;https://selenium-python.readthedocs.io/&quot;&gt;selenium&lt;/a&gt; and &lt;a href=&quot;https://www.crummy.com/software/BeautifulSoup/bs4/doc/&quot;&gt;beautiful soup&lt;/a&gt;&lt;/p&gt;
&lt;h1&gt;Demo: Scraping Indeed Jobs&lt;/h1&gt;
&lt;p&gt;Let’s get some practice, the goal of this demo is to scrape jobs from indeed given a search query and save them in csv file.&lt;/p&gt;
&lt;p&gt;More precisely we are interested in:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Job title&lt;/li&gt;
&lt;li&gt;Location of the job&lt;/li&gt;
&lt;li&gt;Company that posted the offer&lt;/li&gt;
&lt;li&gt;Job description&lt;/li&gt;
&lt;li&gt;When the job was posted&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here a link to a sample &lt;a href=&quot;https://ma.indeed.com/viewjob?jk=8fb003e7a434a0c5&amp;#x26;tk=1elgb9sfbstbv800&amp;#x26;from=serp&amp;#x26;vjs=3&quot;&gt;job page&lt;/a&gt; and here’s the &lt;a href=&quot;https://github.com/tariqmassaoudi/IndeedScraping&quot;&gt;project code&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;First let’s import the required libraries:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;from bs4 import BeautifulSoup  
from webdriver_manager.chrome import ChromeDriverManager  
import pandas as pd  
from selenium import webdriver  
from selenium.webdriver.chrome.options import Options
chrome_options = Options()  
chrome_options.add_argument(&quot;--headless&quot;)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Beautiful Soup is for interacting with HTML&lt;/li&gt;
&lt;li&gt;Pandas to export to csv&lt;/li&gt;
&lt;li&gt;The web driver is the actual browser, we will be using chrome and configuring it to run on &lt;strong&gt;headless mode&lt;/strong&gt; which means it will run in the background and we won’t be able to see a browser going through the job pages, this is optional if you want to see the browser you can remove it!&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The first thing to do is to get the actual job pages, lucky indeed has a search function, all you have to do is to navigate to :&lt;/p&gt;
&lt;p&gt;“&lt;a href=&quot;https://ma.indeed.com/jobs?q=data+scientist&amp;#x26;start=10%E2%80%9D&quot;&gt;https://ma.indeed.com/jobs?q=data+scientist&amp;#x26;start=10”&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/1400/1*YbDv1RXV1AZGmKX2Xp7m_g.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;You’ll get the second page of jobs related to data science, so you can specify the search query changing the &lt;strong&gt;q&lt;/strong&gt; argument and the page number changing the &lt;strong&gt;start&lt;/strong&gt; argument. Note that I’m using the Moroccan portal of Indeed, but this will work for any country.&lt;/p&gt;
&lt;p&gt;We will be implementing two functions one is a helper function to navigate to a URL extracting the HTML and turning it to Beautiful Soup object that we can interact with and another to extracts links to the job pages:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;def toSoup(url):  
    driver.get(url)  
    html = driver.page_source  
    soup = BeautifulSoup(html, &apos;lxml&apos;)  
    return soupdef getPageUrls(query,number):  
    url=&quot;[https://ma.indeed.com/emplois?q=](https://ma.indeed.com/emplois?q=)&quot;+str(query)+&quot;&amp;amp;start=&quot;+str(((number-1)\*10))  
    soup=toSoup(url)  
    maxPages=soup.find(&quot;div&quot;,{&quot;id&quot;:&quot;searchCountPages&quot;}).text.strip().split(&quot; &quot;)[3]  
    return maxPages,[appendIndeedUrl(a[&quot;href&quot;]) for a in soup.findAll(&quot;a&quot;,{&quot;class&quot;:&quot;jobtitle turnstileLink&quot;})]&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now that we have the URLs let’s implement some functions to extract what we want out of the job page:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;def paragraphArrayToSingleString(paragraphs):  
    string=&quot;&quot;  
    for paragraph in paragraphs:  
        string=string+&quot;\\n&quot;+paragraph.text.strip()  
    return stringdef appendIndeedUrl(url):  
    return &quot;[https://ma.indeed.com](https://ma.indeed.com)&quot;+str(url)def processPage(url):  
    soup=toSoup(url)  
    title=soup.find(&quot;h1&quot;,{&quot;class&quot;:&quot;icl-u-xs-mb--xs icl-u-xs-mt--none jobsearch-JobInfoHeader-title&quot;}).text.strip()  
    CompanyAndLocation=soup.find(&quot;div&quot;,{&quot;class&quot;:&quot;jobsearch-InlineCompanyRating icl-u-xs-mt--xs jobsearch-DesktopStickyContainer-companyrating&quot;})  
    length=len(CompanyAndLocation)  
    if length==3:  
        company=CompanyAndLocation.findAll(&quot;div&quot;)[0].text.strip()  
        location=CompanyAndLocation.findAll(&quot;div&quot;)[2].text.strip()  
    else:  
        company=&quot;NAN&quot;  
        location=CompanyAndLocation.findAll(&quot;div&quot;)[0].text.strip()  
    date=soup.find(&quot;div&quot;,{&quot;class&quot;:&quot;jobsearch-JobMetadataFooter&quot;}).text.split(&quot;-&quot;)[1].strip()  
    description=paragraphArrayToSingleString(soup.find(&quot;div&quot;,{&quot;id&quot;:&quot;jobDescriptionText&quot;}).findAll())  
    return {&quot;title&quot;:title,&quot;company&quot;:company,&quot;location&quot;:location,&quot;date&quot;:date,&quot;description&quot;:description}def getMaxPages(query):  
    url=&quot;[https://ma.indeed.com/emplois?q=](https://ma.indeed.com/emplois?q=)&quot;+str(query)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here we are using HTML attributes such as “class” or “id” to locate information we want, you can figure out how to select the data you need by inspecting the page&lt;/p&gt;
&lt;p&gt;Here’s an example for the title property:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/1400/1*fK-8qgIao3G_IPZhEhk6Pg.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;We can see that the title is an “h1” that we can select using its class&lt;/p&gt;
&lt;p&gt;Finally let’s implement a function to run get all the jobs and save them in csv file.&lt;/p&gt;
&lt;p&gt;Note that we are getting the max pages number so that the crawler stops when we have reached the final page.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;def getJobsForQuery(query):  
    data=[]  
    maxPages=999  
    for number in range(maxPages):  
        maxPages,urls=getPageUrls(query,number+1)  
        for url in urls:  
            try:  
                page=processPage(url)  
                data.append(page)  
            except:  
                pass  
        print(&quot;finished Page number: &quot;+str(number+1))  
    #Save the data to a csv file  
    pd.DataFrame(data).to_csv(&quot;jobs_&quot;+query+&quot;.csv&quot;)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now let’s scrape Data Science Jobs:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;driver = webdriver.Chrome(ChromeDriverManager().install(),options=chrome_options)  
getJobsForQuery(&quot;data scientist&quot;)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here’s the result:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/1400/1*VVRsIA6zuTDgtgdYGeuT_g.png&quot; alt=&quot;&quot;&gt;A Sample of scraped jobs&lt;/p&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;In this article we learned about the web scraping, why it’s important for every aspiring data scientist and the different approaches to do so, and we’ve applied that to scrape jobs from Indeed.&lt;/p&gt;
&lt;p&gt;if you managed to get here Congratulations. Thanks for reading, I hope you’ve enjoyed the article. For personal contact or discussion, feel free to reach out to me on &lt;a href=&quot;https://www.linkedin.com/in/tariqmassaoudi/&quot;&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Arabic Topic Classification On The Hespress News Dataset]]></title><description><![CDATA[How to classify Arabic Text the right way Photo by Markus Winkler on UnsplashPhoto by Markus Winkler on Unsplash This article is the first…]]></description><link>https://www.tariqmassaoudi.com/arabic-topic-classification-on-the-hespress-news-dataset-7adceef12bed/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/arabic-topic-classification-on-the-hespress-news-dataset-7adceef12bed/</guid><pubDate>Sun, 18 Oct 2020 23:46:37 GMT</pubDate><content:encoded>&lt;p&gt;How to classify Arabic Text the right way&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/7998/0*-MNlm46aQfTjdCpp&quot; alt=&quot;Photo by Markus Winkler on Unsplash&quot;&gt;&lt;em&gt;Photo by &lt;a href=&quot;https://unsplash.com/@markuswinkler?utm_source=medium&amp;#x26;utm_medium=referral&quot;&gt;Markus Winkler&lt;/a&gt; on &lt;a href=&quot;https://unsplash.com?utm_source=medium&amp;#x26;utm_medium=referral&quot;&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This article is the first in a series where I’ll cover analysis of the Hespress Dataset.&lt;/p&gt;
&lt;p&gt;According to “alexa.com” Hespress is ranked 4’th in Morrocco, it’s the biggest news site in the country and the average Moroccan spends around 6 minutes daily on the website.&lt;/p&gt;
&lt;p&gt;The Hespress Dataset is a collection of 11K news articles labelled by topic and 300K comments with a score by the users associated to each one of them, think of the scores as likes on a Facebook post. This dataset can be used for news article classification which will be our focus in this article and for sentimental analysis of the Moroccan general opinion. You can download the Dataset using the link below:
&lt;a href=&quot;https://www.kaggle.com/tariqmassaoudi/hespress&quot;&gt;&lt;strong&gt;Hespress&lt;/strong&gt;
*Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data…*www.kaggle.com&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This article is aimed for people that have a little bit of knowledge about machine learning for example what’s the difference between classification and regression, what’s cross validation. However, I’ll give a brief explanation of the steps pursued for the project.&lt;/p&gt;
&lt;h2&gt;&lt;strong&gt;Problem Introduction:&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Fortunately, our dataset contains both the articles and their labels, so we are dealing with a supervised learning problem which will make our life much easier since, if wasn’t the case, we would have to manually label each article or go with an unsupervised approach.&lt;/p&gt;
&lt;p&gt;In brief, our goal is to predict the topic of an article given its text. In total we have 11 topics:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Tamazight (A Moroccan Language)&lt;/li&gt;
&lt;li&gt;Sport (Sport)&lt;/li&gt;
&lt;li&gt;Societe (Society)&lt;/li&gt;
&lt;li&gt;Regions (Regions)&lt;/li&gt;
&lt;li&gt;Politique (Politics)&lt;/li&gt;
&lt;li&gt;Orbites (World news)&lt;/li&gt;
&lt;li&gt;Medias (News from local newspapers)&lt;/li&gt;
&lt;li&gt;Marocains Du Monde (Moroccans of the world)&lt;/li&gt;
&lt;li&gt;Faits Divers (Miscellaneous)&lt;/li&gt;
&lt;li&gt;Economie (Economy)&lt;/li&gt;
&lt;li&gt;Art Et Culture (Art and culture)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;&lt;strong&gt;Exploratory Data Analysis:&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;We’ll be using seaborn for data visualisation and pandas for data manipulation.&lt;/p&gt;
&lt;p&gt;Let’s start by loading the data:&lt;/p&gt;
&lt;p&gt;Since the data is stored in different files, each file contains data for a specific topic, we’ll have to loop over the topics and concatenate results.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; pandas &lt;span class=&quot;token keyword&quot;&gt;as&lt;/span&gt; pd
stories&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;pd&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;DataFrame&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
topics&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;tamazight&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;sport&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;societe&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;regions&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;politique&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;orbites&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;medias&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;marocains-du-monde&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;faits-divers&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;economie&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;art-et-culture&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; topic &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; topics&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
  stories&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;pd&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;concat&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;stories&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;pd&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;read_csv&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;stories_&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;+&lt;/span&gt;topic&lt;span class=&quot;token operator&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;.csv&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;

stories&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;drop&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;columns&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;Unnamed: 0&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;axis&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;inplace&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token boolean&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Next let’s get a sample from the data:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;stories&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;sample&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2252/1*_uM4fxYgLH_PrQ_zWbZHcg.png&quot; alt=&quot;Sample columns from the stories dataset&quot;&gt;&lt;em&gt;Sample columns from the stories dataset&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;We can see that we have 5 columns, for this article we are only interested in the story and the topic features.&lt;/p&gt;
&lt;p&gt;Now let’s check how much stories we have in each topic, this is extremely important for classification since if we have an **imbalanced dataset **i.e.(we have a lot more datapoints in a topic than the others) our model will be biased and won’t work as well. If we have this problem one common solution is to apply an &lt;strong&gt;under sampling&lt;/strong&gt; or &lt;strong&gt;oversampling&lt;/strong&gt; method, we won’t go over the details since it’s not in the scope of our article.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; seaborn &lt;span class=&quot;token keyword&quot;&gt;as&lt;/span&gt; sns
storiesByTopic&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;stories&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;groupby&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;by&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;topic&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;count&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;story&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
sns&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;barplot&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;storiesByTopic&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;index&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;y&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;storiesByTopic&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2282/1*Tnrx36tYvfFLtTtInoyKCQ.png&quot; alt=&quot;Count of stories by topic&quot;&gt;&lt;em&gt;Count of stories by topic&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;We can see that we have almost 1000 stories per topic, our dataset is perfectly balanced.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2000/0*qA46C2LjmtYma_wb.jpg&quot; alt=&quot;Source: memegenerator.net&quot;&gt;&lt;em&gt;Source: memegenerator.net&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;&lt;strong&gt;Data Cleaning:&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;We are dealing with Arabic text data. Our data cleaning process will consist of 2 steps:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Removing Stop Words&lt;/strong&gt;: some words such as “و”, “كيف” have extremely high recurrence in all Arabic texts and provide no meaning that our model can use to predict. Removing them will reduce noise and let our model focus only on relevant words. To do so we will be using a list and looping over all the articles removing all the words that appear in the list.&lt;/p&gt;
&lt;p&gt;The stop words list that I used is available on &lt;a href=&quot;https://github.com/mohataher/arabic-stop-words/blob/master/list.txt&quot;&gt;Github&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; nltk&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;tokenize &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; word_tokenize

file1 &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;stopwordsarabic.txt&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;r&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; encoding&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;utf-8&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; 
stopwords_arabic &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; file1&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;read&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;splitlines&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;المغرب&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;المغربية&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;المغربي&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;removeStopWords&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;text&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;stopwords&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    text_tokens &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; word_tokenize&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;text&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;join&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;word &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; word &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; text_tokens &lt;span class=&quot;token keyword&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;not&lt;/span&gt; word &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; stopwords&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Removing Punctuation&lt;/strong&gt;: For the same reason we’ll be removing punctuation, for this I’ve used a Regex expression.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; nltk&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;tokenize &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; RegexpTokenizer
&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;removePunctuation&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;text&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    tokenizer &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; RegexpTokenizer&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;r&apos;\w+&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;join&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;tokenizer&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;tokenize&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;text&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;Drawing a WordCloud:&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Let’s have some fun, we’re going to be drawing a Word Cloud off all the stories in our DataSet using the python “&lt;strong&gt;WordCloud&lt;/strong&gt;” library&lt;/p&gt;
&lt;p&gt;Before doing so there’s some extra steps needed that are specific for Arabic, to learn more about them visit this &lt;a href=&quot;https://amueller.github.io/word_cloud/auto_examples/arabic.html&quot;&gt;link&lt;/a&gt;.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; arabic_reshaper
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; bidi&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;algorithm &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; get_display
&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; matplotlib&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;pyplot &lt;span class=&quot;token keyword&quot;&gt;as&lt;/span&gt; plt
&lt;span class=&quot;token operator&quot;&gt;%&lt;/span&gt;matplotlib inline

&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;preprocessText&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;text&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;stopwords&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;wordcloud&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token boolean&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    noStop&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;removeStopWords&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;text&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;stopwords&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    noPunctuation&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;removePunctuation&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;noStop&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;if&lt;/span&gt; wordcloud&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
        text&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;arabic_reshaper&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;reshape&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;noPunctuation&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
        text&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;get_display&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;text&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; text
    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; noPunctuation

drawWordcloud&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;stories&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;story&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;stopwords_arabic&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2280/1*Uc9cBE2aEXJRMDAuLrFIKA.png&quot; alt=&quot;Word Cloud of Hespress News Articles&quot;&gt;&lt;em&gt;Word Cloud of Hespress News Articles&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Since this dataset contains recent news articles we see “كورونا” (coronavirus) as a recurring word. There’s also “الامازيغية” which is a major language in Morocco, “محمد” which is the most popular name in Morocco and is also the name of the King of Morocco and “الحكومة” which means the government.&lt;/p&gt;
&lt;h2&gt;&lt;strong&gt;Feature engineering:&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Machine learning models are in their essence mathematical equations and can’t understand text, so before running our models we need to transform our text to numbers, there’s multiple approaches to do this let’s discover the 2 most popular ones.&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;strong&gt;Word Count:&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This one is very simple, every columns represents a word from the entire stories corpus, and every row represents a story, the cell values are the frequency in which a word appears in the story!&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;TF–IDF:&lt;/strong&gt;&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;TF-IDF stands for “Term Frequency Inverse Document Frequency” it uses a slightly more complicated approach which will penalize common words that occur in multiple documents.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;We will be using TF-IDF since it in most cases it yields better performance!&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; sklearn&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;feature_extraction&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;text &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; TfidfVectorizer

&lt;span class=&quot;token comment&quot;&gt;#Clean the stories &lt;/span&gt;
stories&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;storyClean&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;stories&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;story&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;apply&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;lambda&lt;/span&gt; s&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; preprocessText&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;s&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;stopwords_arabic&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;token comment&quot;&gt;#Vectorize the stories&lt;/span&gt;
vectorizer &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; TfidfVectorizer&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
X &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; vectorizer&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;fit_transform&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;stories&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;storyClean&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
y&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;stories&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;topic&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2&gt;&lt;strong&gt;Modelling:&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;We will try the following models:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;Random Forest&lt;/li&gt;
&lt;li&gt;Logistic Regression&lt;/li&gt;
&lt;li&gt;SGDClassifier&lt;/li&gt;
&lt;li&gt;Multinomial Naïve Bayes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We will run the data through each model and use the &lt;strong&gt;accuracy&lt;/strong&gt; which is the ratio of correct predictions and total datapoints as our metric, for more accurate results we have used cross validation with 5 folds for our scoring then we will be plotting the results.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; sklearn&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;model_selection &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; train_test_split
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; sklearn&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;metrics &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; accuracy_score
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; sklearn&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;model_selection &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; cross_val_score
&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; numpy &lt;span class=&quot;token keyword&quot;&gt;as&lt;/span&gt; np
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; sklearn&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;metrics &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; classification_report

&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;testModel&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;model&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;X&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;y&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    X_train&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; X_test&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; y_train&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; y_test &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; train_test_split&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;X&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; y&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; test_size&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0.2&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; random_state&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;42&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    model&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;fit&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;X_train&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;y_train&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    modelName &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;model&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;__name__
    pred&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;model&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;predict&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;X_test&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;modelName&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;classification_report&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;y_test&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;model&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;predict&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;X_test&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    score&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;np&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;mean&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;cross_val_score&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;model&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; X&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; y&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; cv&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; model&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;model&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;modelName&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;score&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;score&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2272/1*0roKplrxJ6UXDwdAmaTxqg.png&quot; alt=&quot;Models accuracy&quot;&gt;&lt;em&gt;Models accuracy&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Our best model is &lt;strong&gt;SDGClassifier&lt;/strong&gt; with an accuracy of 87 %&lt;/p&gt;
&lt;h2&gt;&lt;strong&gt;Model Interpretation:&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Now that we got a working model let’s try to understand a bit more what’s happening, for that we will be answering two questions:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;What topics does our model struggle with?&lt;/li&gt;
&lt;li&gt;What words are most influential in predicting different topics?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For the first questions we can check the classification report of our best model:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2000/1*1tZdFEDNUQrnkh171mZ8rw.png&quot; alt=&quot;Classification Report SGDClassifier&quot;&gt;&lt;em&gt;Classification Report SGDClassifier&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;We are predicting “Sport”, “Art”, “Medias”, “Tamazight” with an extremely high accuracy. We are struggling the most with “&lt;strong&gt;orbites&lt;/strong&gt;” (world news), “&lt;strong&gt;societe&lt;/strong&gt;” (Society) this might be because these two are more general and broad topics.&lt;/p&gt;
&lt;p&gt;To answer the second question, we will be using a useful property of logistic regression, we can use the weights as a measure of the importance of the words in each model. “&lt;strong&gt;ELI5&lt;/strong&gt;” a python library makes it easy to do that:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2240/1*N3olha4sCIs5S13IaFlwcQ.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2000/1*GIJSoNquPJn7kyHQBHfy3Q.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;We can see that most of the words make sense and correspond to the theme of the topic, for example for “Art” the top words are: “Artist”, “Film”,” Culture”, ”Book”.&lt;/p&gt;
&lt;h2&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;In this article, we’ve gone through all the steps required to design a text classification system for Arabic from Data Exploration to Model Interpretation. However, we can still improve our accuracy by tuning the hyperparameters.&lt;/p&gt;
&lt;p&gt;In the next article, we’ll try to make sense of the comments on each article using &lt;strong&gt;Sentimental Analysis.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;if you managed to get here Congratulations. Thanks for reading, I hope you’ve enjoyed the article. For personal contact or discussion, feel free to reach out to me on &lt;a href=&quot;https://www.linkedin.com/in/tariqmassaoudi/&quot;&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[What You Should Know About Ensemble Learning]]></title><description><![CDATA[The wisdom of the crowds for machines Photo by Markus Spiske on UnsplashPhoto by Markus Spiske on Unsplash Introduction: You want to…]]></description><link>https://www.tariqmassaoudi.com/what-you-should-know-about-ensemble-learning-e92d4b3c3608/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/what-you-should-know-about-ensemble-learning-e92d4b3c3608/</guid><pubDate>Sun, 27 Sep 2020 22:40:32 GMT</pubDate><content:encoded>&lt;p&gt;The wisdom of the crowds for machines&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/8736/0*HroBikMnYnFLNQvU&quot; alt=&quot;Photo by Markus Spiske on Unsplash&quot;&gt;&lt;em&gt;Photo by &lt;a href=&quot;https://unsplash.com/@markusspiske?utm_source=medium&amp;#x26;utm_medium=referral&quot;&gt;Markus Spiske&lt;/a&gt; on &lt;a href=&quot;https://unsplash.com?utm_source=medium&amp;#x26;utm_medium=referral&quot;&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;Introduction:&lt;/h2&gt;
&lt;p&gt;You want to organize a movie night with your friends and you’re looking for the perfect movie, you search on Netflix and you stumble upon one that caught your attention. To decide if the movie is worth watching or not you have multiple options.&lt;/p&gt;
&lt;p&gt;Option A: Go ask your brother who has already watched the movie.&lt;/p&gt;
&lt;p&gt;Option B: Go to IMDB check the rating &amp;#x26; read multiple hopefully spoiler free reviews.&lt;/p&gt;
&lt;p&gt;You’ll obviously go with option B since the risk of getting a biased opinion is less if you get multiple points of view as opposed of a single opinion from your brother. This is the idea and motivation behind ensemble methods. It’s the wisdom of crowds! Now let’s dive into a more technical definition of ensemble learning.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/11060/0*rfHWRmxY_Yz4OuMI&quot; alt=&quot;Photo by Arian Darvishi on Unsplash&quot;&gt;&lt;em&gt;Photo by &lt;a href=&quot;https://unsplash.com/@arianismmm?utm_source=medium&amp;#x26;utm_medium=referral&quot;&gt;Arian Darvishi&lt;/a&gt; on &lt;a href=&quot;https://unsplash.com?utm_source=medium&amp;#x26;utm_medium=referral&quot;&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;What is ensemble learning:&lt;/h2&gt;
&lt;p&gt;According to scholarpedia:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Ensemble learning is the process by which multiple models, such as classifiers or experts, are strategically generated and combined to solve a particular computational intelligence problem.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Which means taking the generating multiple models and taking their opinion smart ways such as to get the best prediction possible. In theory, an ensemble model will always outperform a single model. For this to effectively work the individual models constructing an ensemble should be different, it’s no point taking the collective opinion if all individual opinions are the same. We can differentiate our models by using different algorithms, changing the hyper parameters, or training them on different parts of our dataset.&lt;/p&gt;
&lt;h2&gt;How do we ensemble learn (techniques):&lt;/h2&gt;
&lt;h3&gt;Bagging:&lt;/h3&gt;
&lt;p&gt;Stands for “&lt;em&gt;bootstrap aggregating” and it’s one of the simplest and most intuitive techniques to understand. In bagging we will be using the same algorithm while training on different subset of the data. To get these subsets we use a technique called &lt;strong&gt;bootstrapping:&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2000/1*MGZ9rfKx2dSRI-K7IYPCCg.png&quot; alt=&quot;Basic bootstrapping illustration, Image by Author&quot;&gt;&lt;em&gt;Basic bootstrapping illustration, Image by Author&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;As you can see &lt;strong&gt;Apple&lt;/strong&gt; is repeated 2 times. In practice we often choose a smaller size for the bootstrapped datasets. After creating some bootstrapped datasets we will a model on each then combine them to make an ensemble model, this is called &lt;strong&gt;aggregation&lt;/strong&gt;.** **For classification problems the class with the most votes is the prediction and for regression problems we average the output of our models.&lt;/p&gt;
&lt;h3&gt;Boosting:&lt;/h3&gt;
&lt;p&gt;While bagging can be done in parallel (just train all your models at the same time), boosting is an iterative process. Like bagging we will be using the same algorithm, but we won’t be bootstrapping the data and training all the models at the same time. Boosting is sequential which means train models one by one and the performance of the previous model will impact how we select the training dataset for the next model, more precisely each new model will try to correct mistakes made by its predecessor&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2000/1*392-uo0h6JbiCixcbHIHyQ.png&quot; alt=&quot;The basic workings of boosting, Image by Author&quot;&gt;&lt;em&gt;The basic workings of boosting, Image by Author&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Popular algorithms that implement boosting are &lt;strong&gt;AdaBoost&lt;/strong&gt; and &lt;strong&gt;Gradient Boosting.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;Stacking:&lt;/h3&gt;
&lt;p&gt;This one is simple, we will be using different algorithms and just combining their predictions.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2000/1*012oLlKPCVpgqNI-nOQGMQ.png&quot; alt=&quot;Basic workings of Stacking, Image by Author&quot;&gt;&lt;em&gt;Basic workings of Stacking, Image by Author&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;Why should you ensemble learn?&lt;/h2&gt;
&lt;p&gt;As intuition and practice confirms ensemble methods yield more accurate results and when used wisely are more resilient to overfitting thus, they are widely used in Kaggle competitions. One drawback is that they require a lot more time to train.&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;Ensemble learning is turning multiple weak models to one strong model “together we are stronger”. Multiple techniques have been developed to accomplish this such as bagging, boosting and stacking. An ensemble model is always more accurate than a single model and can generalise better.&lt;/p&gt;
&lt;p&gt;I hope you’ve got a basic idea behind ensemble models. Now it’s time to implement then into your projects!&lt;/p&gt;
&lt;p&gt;Thanks for reading! ❤&lt;/p&gt;
&lt;p&gt;Follow me for more informative data science content.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[ML Basics : Loan Prediction]]></title><description><![CDATA[The complete Data Science pipeline on a simple problem Photo by Dmitry Demidko on UnsplashPhoto by Dmitry Demidko on Unsplash The problem…]]></description><link>https://www.tariqmassaoudi.com/ml-basics-loan-prediction-d695ba7f31f6/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/ml-basics-loan-prediction-d695ba7f31f6/</guid><pubDate>Thu, 06 Jun 2019 22:40:32 GMT</pubDate><content:encoded>&lt;p&gt;The complete Data Science pipeline on a simple problem&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/12000/0*eC1EUwoo6rMOypir&quot; alt=&quot;Photo by Dmitry Demidko on Unsplash&quot;&gt;&lt;em&gt;Photo by &lt;a href=&quot;https://unsplash.com/@wildbook?utm_source=medium&amp;#x26;utm_medium=referral&quot;&gt;Dmitry Demidko&lt;/a&gt; on &lt;a href=&quot;https://unsplash.com?utm_source=medium&amp;#x26;utm_medium=referral&quot;&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;The problem:&lt;/h2&gt;
&lt;p&gt;Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan.&lt;/p&gt;
&lt;p&gt;The Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers.&lt;/p&gt;
&lt;p&gt;It’s a classification problem , given information about the application we have to predict whether the they’ll be to pay the loan or not.&lt;/p&gt;
&lt;p&gt;We’ll start by exploratory data analysis , then preprocessing , and finally we’ll be testing different models such as Logistic regression and decision trees.&lt;/p&gt;
&lt;p&gt;The data consists of the following rows:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;**Loan_ID : **Unique Loan ID

**Gender : **Male/ Female

**Married : **Applicant married (Y/N)

**Dependents : **Number of dependents 

**Education : **Applicant Education (Graduate/ Under Graduate)

**Self_Employed : **Self employed (Y/N)

**ApplicantIncome : **Applicant income

**CoapplicantIncome : **Coapplicant income

**LoanAmount : **Loan amount in thousands of dollars

**Loan_Amount_Term : **Term of loan in months

**Credit_History : **credit history meets guidelines yes or no

**Property_Area : **Urban/ Semi Urban/ Rural

**Loan_Status : **Loan approved (Y/N) this is the target variable&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2&gt;Exploratory data analysis:&lt;/h2&gt;
&lt;p&gt;We’ll be using seaborn for visualisation and pandas for data manipulation. You can download the dataset from here : &lt;a href=&quot;https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/&quot;&gt;https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;We’ll import the necessary libraries and load the data :&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import numpy as np

train=pd.read_csv(&quot;train.csv&quot;)
test=pd.read_csv(&quot;test.csv&quot;)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We can look at few top rows using the head function&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;train.head()&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2534/1*UZr5Mmbw9vErMiEkWGlsPA.png&quot; alt=&quot;Image by Author&quot;&gt;&lt;em&gt;Image by Author&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;We can see that there’s some missing data , we can further explore this using the pandas describe function:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;train.describe()&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2000/1*v90EnHnMdSz5LjunB0o3Sg.png&quot; alt=&quot;Image by Author&quot;&gt;&lt;em&gt;Image by Author&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Some variables have missing values that we’ll have to deal with , and also there seems to be some outliers for the Applicant Income , Coapplicant income and Loan Amount . We also see that about 84% applicants have a credit_history. Because the mean of Credit_History field is 0.84 and it has either (1 for having a credit history or 0 for not)&lt;/p&gt;
&lt;p&gt;It would be interesting to study the distribution of the numerical variables mainly the Applicant income and the loan amount. To do this we’ll use seaborn for visualization.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;sns.distplot(train.ApplicantIncome,kde=False)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2000/1*I13ZNw8VWHEitAIVSq6_ug.png&quot; alt=&quot;Image by Author&quot;&gt;&lt;em&gt;Image by Author&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The distribution is skewed and we can notice quite a few outliers.&lt;/p&gt;
&lt;p&gt;Since Loan Amount has missing values , we can’t plot it directly. One solution is to drop the missing values rows then plot it, we can do this using the dropna function&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;sns.distplot(train.ApplicantIncome.dropna(),kde=False)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2000/1*NrCGWURr4W9Xjw7PBxcbEA.png&quot; alt=&quot;Image by Author&quot;&gt;&lt;em&gt;Image by Author&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;People with better education should normally have a higher income, we can check that by plotting the education level against the income.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;sns.boxplot(x=&apos;Education&apos;,y=&apos;ApplicantIncome&apos;,data=train)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2000/1*SzZv-uwE6H3FNv8gqB4IIQ.png&quot; alt=&quot;Image by Author&quot;&gt;&lt;em&gt;Image by Author&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The distributions are quite similar but we can see that the graduates have more outliers which means that the people with huge income are most likely well educated.&lt;/p&gt;
&lt;p&gt;Another interesting variable is credit history , to check how it affects the Loan Status we can turn it into binary then calculate it’s mean for each value of credit history . A value close to 1 indicates a high loan success rate&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;#turn loan status into binary 
modified=train
modified[&apos;Loan_Status&apos;]=train[&apos;Loan_Status&apos;].apply(lambda x: 0 if x==&quot;N&quot; else 1 )
#calculate the mean
modified.groupby(&apos;Credit_History&apos;).mean()[&apos;Loan_Status&apos;]

OUT : 
Credit_History
0.0    0.078652
1.0    0.795789
Name: Loan_Status, dtype: float64&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;People with a credit history a way more likely to pay their loan, 0.07 vs 0.79 . This means that credit history will be an influential variable in our model.&lt;/p&gt;
&lt;h2&gt;Data preprocessing:&lt;/h2&gt;
&lt;p&gt;The first thing to do is to deal with the missing value , lets check first how many there are for each variable.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;train.apply(lambda x: sum(x.isnull()),axis=0)
OUT:
Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;For numerical values a good solution is to fill missing values with the mean , for categorical we can fill them with the mode (the value with the highest frequency)&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;#categorical
train[&apos;Gender&apos;].fillna(train[&apos;Gender&apos;].mode()[0], inplace=True)
train[&apos;Married&apos;].fillna(train[&apos;Married&apos;].mode()[0], inplace=True)
train[&apos;Dependents&apos;].fillna(train[&apos;Dependents&apos;].mode()[0], inplace=True)
train[&apos;Loan_Amount_Term&apos;].fillna(train[&apos;Loan_Amount_Term&apos;].mode()[0], inplace=True)
train[&apos;Credit_History&apos;].fillna(train[&apos;Credit_History&apos;].mode()[0], inplace=True)
train[&apos;Self_Employed&apos;].fillna(train[&apos;Self_Employed&apos;].mode()[0], inplace=True)
#numerical

df[&apos;LoanAmount&apos;].fillna(df[&apos;LoanAmount&apos;].mean(), inplace=True)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Next we have to handle the outliers , one solution is just to remove them but we can also log transform them to nullify their effect which is the approach that we went for here. Some people might have a low income but strong CoappliantIncome so a good idea is to combine them in a TotalIncome column.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;train[&apos;LoanAmount_log&apos;]=np.log(train[&apos;LoanAmount&apos;])
train[&apos;TotalIncome&apos;]= train[&apos;ApplicantIncome&apos;] +train[&apos;CoapplicantIncome&apos;] train[&apos;TotalIncome_log&apos;]=np.log(train[&apos;TotalIncome&apos;])&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;plotting the histogram of loan amount log we can see that it’s a normal distribution!&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2000/1*F-nKRRSkTiGgmFznRoRKfg.png&quot; alt=&quot;Image by Author&quot;&gt;&lt;em&gt;Image by Author&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;Modeling:&lt;/h2&gt;
&lt;p&gt;We’re gonna use sklearn for our models , before doing that we need to turn all the categorical variables into numbers. We’ll do that using the LabelEncoder in sklearn&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;from sklearn.preprocessing import LabelEncoder
category= [&apos;Gender&apos;,&apos;Married&apos;,&apos;Dependents&apos;,&apos;Education&apos;,&apos;Self_Employed&apos;,&apos;Property_Area&apos;,&apos;Loan_Status&apos;] 
encoder= LabelEncoder()
 for i in category:   
  train[i] = encoder.fit_transform(train[i]) 
  train.dtypes

OUT:
Loan_ID               object
Gender                 int64
Married                int64
Dependents             int64
Education              int64
Self_Employed          int64
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area          int64
Loan_Status            int64
LoanAmount_log       float64
TotalIncome          float64
TotalIncome_log      float64
dtype: object&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now all our variables have became numbers that our models can understand.&lt;/p&gt;
&lt;p&gt;To try out different models we’ll create a function that takes in a model , fits it and mesures the accuracy which means using the model on the train set and mesuring the error on the same set . And we’ll use a technique called Kfold cross validation which splits randomly the data into train and test set, trains the model using the train set and validates it with the test set, it will repeat this K times hence the name Kfold and takes the average error. The latter method gives a better idea on how the model performs in real life.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;#Import the models
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold   #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics

def classification_model(model, data, predictors, outcome):
  #Fit the model:
  model.fit(data[predictors],data[outcome])
  
  #Make predictions on training set:
  predictions = model.predict(data[predictors])
  
  #Print accuracy
  accuracy = metrics.accuracy_score(predictions,data[outcome])
  print (&quot;Accuracy : %s&quot; % &quot;{0:.3%}&quot;.format(accuracy))

#Perform k-fold cross-validation with 5 folds
  kf = KFold(data.shape[0], n_folds=5)
  error = []
  for train, test in kf:
    # Filter training data
    train_predictors = (data[predictors].iloc[train,:])
    
    # The target we&apos;re using to train the algorithm.
    train_target = data[outcome].iloc[train]
    
    # Training the algorithm using the predictors and target.
    model.fit(train_predictors, train_target)
    
    #Record error from each cross-validation run
    error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
 
  print (&quot;Cross-Validation Score : %s&quot; % &quot;{0:.3%}&quot;.format(np.mean(error)))&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now we can test different models we’ll start with logistic regression:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;outcome_var = &apos;Loan_Status&apos;
model = LogisticRegression()
predictor_var = [&apos;Credit_History&apos;,&apos;Education&apos;,&apos;Married&apos;,&apos;Self_Employed&apos;,&apos;Property_Area&apos;]
classification_model(model, train,predictor_var,outcome_var)
OUT : 
Accuracy : 80.945%
Cross-Validation Score : 80.946%&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We’ll try now a Decision tree which is should give us more accurate result&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;model = DecisionTreeClassifier() predictor_var = [&apos;Credit_History&apos;,&apos;Gender&apos;,&apos;Married&apos;,&apos;Education&apos;] classification_model(model, df,predictor_var,outcome_var)

OUT:
Accuracy : 80.945%
Cross-Validation Score : 78.179%&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We’ve got the same score on accuracy but a worse score in cross validation , a more complex model doesn’t always means a better score.&lt;/p&gt;
&lt;p&gt;Finally we’ll try random forests&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;model = RandomForestClassifier(n_estimators=100)
predictor_var = [&apos;Gender&apos;, &apos;Married&apos;, &apos;Dependents&apos;, &apos;Education&apos;,
       &apos;Self_Employed&apos;, &apos;Loan_Amount_Term&apos;, &apos;Credit_History&apos;, &apos;Property_Area&apos;,
        &apos;LoanAmount_log&apos;,&apos;TotalIncome_log&apos;]
classification_model(model, train,predictor_var,outcome_var)

OUT: 
Accuracy : 100.000%
Cross-Validation Score : 78.015%&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The model is giving us perfect score on accuracy but a low score in cross validation , this a good example of over fitting. The model is having a hard time at generalizing since it’s fitting perfectly to the train set.&lt;/p&gt;
&lt;p&gt;Solutions to this include : Reducing the number of predictors or Tuning the model parameters.&lt;/p&gt;
&lt;h2&gt;Conclusion:&lt;/h2&gt;
&lt;p&gt;We’ve gone through a good portion of the data science pipe line in this article, namely EDA , preprocessing and modeling and we’ve used essential classification models such as Logistic regression , Decision tree and Random forests. It would be interesting to learn more about the backbone logic behind these algorithms, and also tackle the data scraping and deployment phases.We’ll try to do that in the next articles.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[ML Basics: predicting house prices]]></title><description><![CDATA[What’s machine learning: In simple terms , it’s the process of teaching machines to solve particular problems without being explicitly…]]></description><link>https://www.tariqmassaoudi.com/ml-basics-predicting-house-prices-9efc34182dd7/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/ml-basics-predicting-house-prices-9efc34182dd7/</guid><pubDate>Sun, 12 May 2019 22:40:32 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/500/0*KBW4KfmvbEz3WHEv&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h1&gt;&lt;strong&gt;What’s machine learning:&lt;/strong&gt;&lt;/h1&gt;
&lt;p&gt;In simple terms , it’s the process of teaching machines to solve particular problems without being explicitly programmed .&lt;/p&gt;
&lt;p&gt;Sounds fascinating ,but how does one teaches a machine? The answer is using math ,some smart people have figured out ways to simulate how humans learn which is by observation. The core of the machine learning process reduces to feeding a machine learning model a bunch of observations with the corresponding labels which we call “training” . then testing the model observations that it didn’t see in the training phase which we call “validation”, a better model has more accurate validation results.&lt;/p&gt;
&lt;p&gt;Example : Teach a machine how tell if a picture is a cat or a dog&lt;/p&gt;
&lt;p&gt;Step 1 : get a huge number of pictures of cats and dogs and classify them yourself&lt;/p&gt;
&lt;p&gt;Step 2 : feed an ML model the pictures , watch it learn.&lt;/p&gt;
&lt;p&gt;Step 3 : get new pictures of cats and dogs and test if your models perform well&lt;/p&gt;
&lt;h1&gt;&lt;strong&gt;The competition:&lt;/strong&gt;&lt;/h1&gt;
&lt;p&gt;“House prices” is a kaggle competition under the knowledge section , it is meant for beginners to practice their datascience skills . The objective is to predict a house’s price given a bunch of information about it for example : it’s area ,pool’s availability …&lt;/p&gt;
&lt;p&gt;It’s pretty complicated to tackle this kind of challenges without proper background , so in this article we’ll go through the typical machine learning process while simplifying any ambiguous statistical terms , so only basic math skills will be required.&lt;/p&gt;
&lt;p&gt;We’ll start by exploratory data analysis which aims to get a feel of the data by observing ,analyzing it using graphs , this will help us identify important features , spot irregularities …&lt;/p&gt;
&lt;p&gt;Then we’ll do a little bit of data cleaning and preprocessing, so we’ll fix any problems with the data and prepare it to be swollen by our model&lt;/p&gt;
&lt;p&gt;Finally , we’ll use our clean data and feed it to a model of our choice , in this tutorial we’ll be using a simple linear regression model , then we will explore the different ways to evaluate our model’s performance and we’ll also try to improve it.&lt;/p&gt;
&lt;h1&gt;&lt;strong&gt;Exploratory data analysis:&lt;/strong&gt;&lt;/h1&gt;
&lt;p&gt;The first step is to download the dataset from the competition’s webite :&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data&quot;&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;House Prices: Advanced Regression Techniques&lt;/h2&gt;
&lt;h3&gt;Predict sales prices and practice feature engineering, RFs, and gradient boosting&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;http://www.kaggle.com&quot;&gt;www.kaggle.com&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;We’ll get “train.csv”, “test.csv” and “data_description.txt” which explains what each column means.&lt;/p&gt;
&lt;p&gt;Then import the required libraries : seaborn and matplotlib for visualisation , pandas and numpy for data wrangling&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;import pandas as pd  
import matplotlib.pyplot as plt  
import seaborn as sns  
import numpy as np  
%matplotlib inline&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We can use Pandas to read in csv files. The  &lt;code class=&quot;language-text&quot;&gt;pd.read_csv()&lt;/code&gt;  method creates a DataFrame from a csv file.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;train = pd.read_csv(&apos;train.csv&apos;)  
test = pd.read_csv(&apos;test.csv&apos;)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Let’s check the size of the data:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;print (&quot;Train size:&quot;, train.shape)  
print (&quot;Test size:&quot;, test.shape)Train size: (1460, 81)  
Test size: (1459, 80)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We can see that the test data has one missing column which is the price of the house which makes sense cause that’s what we need to predict in the competition.&lt;/p&gt;
&lt;p&gt;Now we’ll look at a few rows of the data using  &lt;code class=&quot;language-text&quot;&gt;DataFrame.head()&lt;/code&gt;  method.&lt;/p&gt;
&lt;p&gt;train.head()&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*dvLK0haKPejUtAL8nvVaXw.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;We can notice that some of the columns such as PoolQC have missing values. We’ll deal with that later.&lt;/p&gt;
&lt;p&gt;To make some sense of the column names we can check the data description file. Here’s a brief version of what you’ll find there.&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;&lt;code class=&quot;language-text&quot;&gt;SalePrice&lt;/code&gt;  — the property’s sale price in dollars. This is the target variable that we’re trying to predict.&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-text&quot;&gt;MSSubClass&lt;/code&gt;  — The building class&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-text&quot;&gt;MSZoning&lt;/code&gt;  — The general zoning classification&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-text&quot;&gt;LotFrontage&lt;/code&gt;  — Linear feet of street connected to property&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-text&quot;&gt;LotArea&lt;/code&gt;  — Lot size in square feet&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-text&quot;&gt;Street&lt;/code&gt;  — Type of road access&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We’re trying to predict the salePrice column using all the other available columns , to get more information about our target variable we can use the describe command&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;train[salePrice].describe()out :count      1460.000000  
mean     180921.195890  
std       79442.502883  
min       34900.000000  
25%      129975.000000  
50%      163000.000000  
75%      214000.000000  
max      755000.000000  
Name: SalePrice, dtype: float64&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;count gives the number of price observations available , the mean is the average sale price, we also get the standard deviation which is a measure of the dispersion in prices , we also get the min , max , and percentiles (explain this later)&lt;/p&gt;
&lt;p&gt;We’ll dive deeper in the salePrice analysis by checking the plotting a historgram and checking it’s skew value.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;plt.rcParams[&apos;figure.figsize&apos;] = [15, 10]  
sns.distplot(train[&apos;SalePrice&apos;]);  
print(&quot;Skewness: %f&quot; % df[&apos;SalePrice&apos;].skew())&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*jqpBCVwP77RBpPEcPxAOng.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;a histogram of the sale price&lt;/p&gt;
&lt;p&gt;Skewness, is the degree of distortion from a normal distribution, in a set of data. A distribution with 0 skewness is perfectly symmetrical. A positive skewness indicates an assymetry to the left and a negative one is to the right&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/446/0*OmGScsk6ulZj4VBr.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Skewness is a problem because it can make our linear regression model inaccurate. We’ll be dealing with it in the preprocessing phase.&lt;/p&gt;
&lt;p&gt;To get a feel of the data we’ll plot some variables and see their effect on price.&lt;/p&gt;
&lt;p&gt;We’ll start by the living Area&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;sns.scatterplot(x=&apos;GrLivArea&apos;,y=&apos;SalePrice&apos;,data=df)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*u0XPgW86pnbr_TnfedR-mg.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;There’s a clear linear relationship , which is good for our model. We can also see some outliers ( Some houses with really large areas and low price) .Outliers can damage the quality of the model so we’ll have to delete them.&lt;/p&gt;
&lt;p&gt;We’ll now check the salePrice vs the Overall quality&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;sns.boxplot(x=&apos;OverallQual&apos;,y=&apos;SalePrice&apos;,data=df)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*VCU06W6U7RXTG_ctiatbjQ.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;As expected when the quality increases so does the sale price&lt;/p&gt;
&lt;p&gt;Finally , to identify the most important variables we’ll check the correlation matrix and rank the variables based on their correlation with the target variable.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;#correlation heatmap  
sns.heatmap(df.corr())  
#correlations sorting  
#top correlated variables  
df.corr()[&apos;SalePrice&apos;].sort_values(ascending=False)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*C2OfTu9ec6U83H4LTL62MA.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;Top correlated variables :

SalePrice        1.000000  
OverallQual      0.790982  
GrLivArea        0.708624  
GarageCars       0.640409  
GarageArea       0.623431  
TotalBsmtSF      0.613581  
1stFlrSF         0.605852  
FullBath         0.560664  
TotRmsAbvGrd     0.533723  
YearBuilt        0.522897  
YearRemodAdd     0.507101  
GarageYrBlt      0.486362  
MasVnrArea       0.477493  
Fireplaces       0.466929  
BsmtFinSF1       0.386420  
LotFrontage      0.351799  
WoodDeckSF       0.324413  
2ndFlrSF         0.319334  
OpenPorchSF      0.315856  
HalfBath         0.284108  
LotArea          0.263843  
BsmtFullBath     0.227122  
BsmtUnfSF        0.214479  
BedroomAbvGr     0.168213  
ScreenPorch      0.111447  
PoolArea         0.092404  
MoSold           0.046432  
3SsnPorch        0.044584&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h1&gt;&lt;strong&gt;Data preprocessing :&lt;/strong&gt;&lt;/h1&gt;
&lt;h2&gt;Handling Null Values:&lt;/h2&gt;
&lt;p&gt;Next, we’ll examine the null or missing values. We’ll check their number across various variables and also an important mesure which is the percentage of null values of the column’s data.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;#missing data count and percentage  
total = train.isnull().sum().sort_values(ascending=False)  
percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)  
missing_data = pd.concat([total, percent], axis=1, keys=[&apos;Total&apos;, &apos;Percent&apos;])  
missing_data.head(20)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/288/1*uutbHCWNNmmLdAR36vPIdw.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;We see that for the PoolQc , MiscFeature , Alley and Fence most the datapoints are null. Althought not the best path , one way to deal with missing data is to fill it with coulumn’s mean which we can do easily using fillna() method.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;train = train.fillna(all_data.mean())&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2&gt;Removing outliers:&lt;/h2&gt;
&lt;p&gt;When we visualized the living area vs SalePrice in the EDA section we found few datapoints that clearly don’t follow the trend, in statistics we call them outliers and they’re can make the model less accurate. In our case to remove those datapoints we can target the houses which the area exceeds 4000 m²&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;#remove outliers  
train = train[train.GrLivArea &amp;lt; 4000]  
sns.scatterplot(x=df.GrLivArea, y=df.SalePrice)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*KyQ-Yb0pYrtDYVUXfn068Q.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Handling skewness:&lt;/h2&gt;
&lt;p&gt;We found positive skewness in the salePrice , to deal with that a common method is to use the log transform. To do that we can use the np.log1p() function. Then we plot again to check if that worked.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;train.SalePrice = np.log1p(train.SalePrice)sns.distplot(df[&apos;SalePrice&apos;], fit=norm);  
fig = plt.figure()&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As you can see our plot in blue is now very close to a normal distribution !&lt;/p&gt;
&lt;p&gt;There’s indeed more variables with skewness that we’d like to remove. A good way to do that is to mesure their skewness and apply the log transform to variables which the skewness exceeds a certain value.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;#log transform all the numerical skewed data
#get all numerical features  
numeric_feats = train.dtypes[train.dtypes != &quot;object&quot;].indexskewed_feats = train[numeric_feats].apply(lambda x: x.skew()) #compute skewnessskewed_feats = skewed_feats[skewed_feats &gt; 0.75]skewed_feats = skewed_feats.indexprint(skewed_feats)train[skewed_feats] = np.log1p(train[skewed_feats])out : Index([&apos;MSSubClass&apos;, &apos;LotFrontage&apos;, &apos;LotArea&apos;, &apos;MasVnrArea&apos;,&apos;BsmtFinSF1&apos;,&apos;BsmtFinSF2&apos;, &apos;BsmtUnfSF&apos;, &apos;TotalBsmtSF&apos;, &apos;1stFlrSF&apos;, &apos;2ndFlrSF&apos;,&apos;LowQualFinSF&apos;, &apos;GrLivArea&apos;, &apos;BsmtHalfBath&apos;,&apos;KitchenAbvGr&apos;,&apos;TotRmsAbvGrd&apos;, &apos;WoodDeckSF&apos;, &apos;OpenPorchSF&apos;,&apos;EnclosedPorch&apos;,&apos;3SsnPorch&apos;,&apos;ScreenPorch&apos;&apos;PoolArea&apos;,&apos;iscVal&apos;],dtype=&apos;object&apos;)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2&gt;Turning categorical columns into dummy variables:&lt;/h2&gt;
&lt;p&gt;Linear regression models can’t handle categorical data , so a common way to solve that probem is to turn categories into new binary columns . For example a column for sex with “male” and “female” will turn into two binary columns named “male” and “female” which can take 0 or 1 as values. We can do that easily in pandas using the built in pd.get_dummies() function.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;train = pd.get_dummies(train)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h1&gt;Modeling :&lt;/h1&gt;
&lt;p&gt;The final step is modeling we’ll be building a simple linear model&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;from sklearn import datasets, linear_model  
from sklearn.metrics import mean_squared_error, r2_score  
# Create linear regression object  
regr = linear_model.LinearRegression()X_train=train[:730]  
Y_train=y[:730]X_test=train[730:]  
Y_test=y[730:]# Train the model using the training sets  
regr.fit(X_train, Y_train)# Make predictions using the testing set  
pred = regr.predict(X_test)print(&quot;Mean squared error: %.9f&quot; % mean_squared_error(newYtest, pred))out :  
Mean squared error: 0.001315085&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We’ve found a 0.001 means squared error , but what does that mean?&lt;/p&gt;
&lt;p&gt;The mean squared error tells how close a regression line is to a set of points. and does this by taking the distances from the points to the regression line and squaring them. The squaring is necessary to remove any negative signs. It also gives more weight to larger differences. It’s called the mean squared error as you’re finding the average of a set of errors.&lt;/p&gt;
&lt;h1&gt;Conclusion:&lt;/h1&gt;
&lt;p&gt;Throughout this article we’ve looked at how to deal with machine learning problem , we’ve gone through all the steps required to solve one from the Exploratory data analysis , data preprocessing to the modeling .To improve the accuracy we could’ve done some feature engineering (creating new features from the features we have) or have used more complex models. I hope this introduction was of great help !&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Bubble sort for dummies]]></title><description><![CDATA[We’ll have fun exploring one of the most simple sorting algorithms! Bubble sort Do we really need sorting algorithms? Humans are indeed an…]]></description><link>https://www.tariqmassaoudi.com/bubble-sort-for-dummies-e3dbe3d9fea9/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/bubble-sort-for-dummies-e3dbe3d9fea9/</guid><pubDate>Sun, 29 Jul 2018 22:40:32 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/646/1*xBNaUDWTVnNIvMTrpS00yQ.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;We’ll have fun exploring one of the most simple sorting algorithms! Bubble sort&lt;/p&gt;
&lt;h1&gt;Do we really need sorting algorithms?&lt;/h1&gt;
&lt;p&gt;Humans are indeed an intelligent specie.We crave on organizing every aspect of our life.In modern times digital life has become as influential as the real one. The solution to organize this online mess is through the use of sorting algorithms. These pieces of coded logic are literally everywhere on the internet. You want to check the latest post on your favorite blog? Well just press the button to sort them by new. You want to find out the cheapest toothbrush on an E-commerce website? Just sort them by price !&lt;/p&gt;
&lt;p&gt;The most important aspect about a sorting algorithm is its speed, no one want to wait decades to get his emails sorted! Fortunately today’s computers are really fast but still only the fastest sorting algorithms are practically used. In this post we will talk about the slowest one. This algorithm is a the best introduction to sorting because of its simplicity but its never used in practice.&lt;/p&gt;
&lt;h1&gt;Bubble Sort intuition:&lt;/h1&gt;
&lt;blockquote&gt;
&lt;p&gt;Tell me and I forget, teach me and I may remember, involve me and I learn.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;One of the best ways to learn an algorithm is to find it out yourself. So in this section we’ll try to invent bubble sort! Are you ready to make some bubbles?&lt;/p&gt;
&lt;p&gt;You have an initial unordered list of numbers. The objective is to sort them! You can perform 2 simple actions. Comparing 2 elements of the list and swapping them. Can you come up with a simple algorithm to sort the list only using those 2 actions?&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*0KdjgvLQe9GPiFaaNmtZhA.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;get a sheet of paper and think it out , it’s worth it !&lt;/p&gt;
&lt;h1&gt;How to Bubble Sort?&lt;/h1&gt;
&lt;p&gt;Hope you had fun inventing algorithms! If you’re lucky you have already came up with bubble sort !&lt;/p&gt;
&lt;p&gt;Bubble sort is comparison based, you basically compare each element with the next one . If the current element is  &lt;strong&gt;smaller&lt;/strong&gt;  than the next element you  &lt;strong&gt;swap&lt;/strong&gt;  them if not you do not swap and go to the next element.&lt;/p&gt;
&lt;p&gt;When you reach the end of the array you go back to the first element and  &lt;strong&gt;repeat the process.&lt;/strong&gt; Stop when the array is sorted !&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*UNPQJvW5wsVocu4NrO5cmA.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Bubble sort on the example array&lt;/p&gt;
&lt;p&gt;You could ask yourself. Well how many repetitions should I perform? It turns out that the maximum needed is (&lt;strong&gt;length of the array -1&lt;/strong&gt;) for our example if the array we had to do 2 repetitions , if the array was completely disordered we would have to do 3!&lt;/p&gt;
&lt;p&gt;{{ … }}
You could ask yourself. Well how many repetitions should I perform? It turns out that the maximum needed is (&lt;strong&gt;length of the array -1&lt;/strong&gt;) for our example if the array we had to do 2 repetitions , if the array was completely disordered we would have to do 3!&lt;/p&gt;
&lt;h1&gt;Bubble sort in code:&lt;/h1&gt;
&lt;p&gt;Finally here’s an implementation of bubble sort in code.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;bubbleSort&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;arr&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token comment&quot;&gt;#get the length of the array&lt;/span&gt;
    n &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;arr&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;token comment&quot;&gt;# Traverse through all the elements of the array&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; i &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;n&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; j &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; n&lt;span class=&quot;token operator&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;token comment&quot;&gt;# if the current element is larger than the next one swap&lt;/span&gt;
            &lt;span class=&quot;token keyword&quot;&gt;if&lt;/span&gt; arr&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;j&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;&gt;&lt;/span&gt; arr&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;j&lt;span class=&quot;token operator&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
                &lt;span class=&quot;token comment&quot;&gt;#this is the python shorcut for swapping&lt;/span&gt;
                arr&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;j&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; arr&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;j&lt;span class=&quot;token operator&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; arr&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;j&lt;span class=&quot;token operator&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; arr&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;j&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h1&gt;How fast is bubble sort?&lt;/h1&gt;
&lt;p&gt;Well as expected it turns out that bubble sort is really slow compared to the more optimized algorithms. In computer science to find out how fast is an algorithm we use the big O notation. Basically it measures how much steps does an algorithm takes in the  &lt;strong&gt;worst case scenario.&lt;/strong&gt; Bubble sort checks all the elements in the array which has a length of let’s say  **n,**and repeats this for  &lt;strong&gt;n-1&lt;/strong&gt;  times in the worst case scenario so the total steps needed is  &lt;strong&gt;n² -n .&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For large numbers n² is actually much bigger than n “you can test it out using a calculator” so we could ignore the n and say that bubble sort has a complexity of O(&lt;strong&gt;n²&lt;/strong&gt;).&lt;/p&gt;
&lt;p&gt;The best algorithms most used algorithms are quicksort and mergesort these can sort in O(n*log(n)) . These will always outperform bubble sort.&lt;/p&gt;
&lt;p&gt;To check this you can calculate n² and n*log(n) let’s try that:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;if we choose n=10
n²=100      and     n*log(n)=10
now for n=1000
n²=1000000   and n*log(n)=3000&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;In this post we learned how bubble sort works . It might be a snail in terms of speed but it’s essential to understand to tackle the more complex algorithms!&lt;/p&gt;
&lt;p&gt;I hope this post helped you to sort your bubbles !&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Algorithmic corner : Linear regression]]></title><description><![CDATA[The basics: In this article we’ll try to uncover how linear regression works. The best way to understand it is through example. Suppose we…]]></description><link>https://www.tariqmassaoudi.com/algorithmic-corner-linear-regression copy/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/algorithmic-corner-linear-regression copy/</guid><pubDate>Fri, 08 Jun 2018 22:12:03 GMT</pubDate><content:encoded>&lt;h1&gt;The basics:&lt;/h1&gt;
&lt;p&gt;In this article we’ll try to uncover how linear regression works. The best way to understand it is through example. Suppose we have the following problem , we are trying to predict a student’s grade given how many times he didn’t attend the class. With enough data points we’ll end up with a graph that looks like this :&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*gcafDoOHrkd5Xr-qpWS_ew.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Doing a linear regression is finding the line that is closest to all the data points, in mathematics the equation of a line is y=ax+b where “a” is the slope and “b” is the intercept. So to find this line we have to find the best “a” and “b” coefficients. But how do we do that, and what does the “best line” means concretely?&lt;/p&gt;
&lt;h1&gt;Least square regression:&lt;/h1&gt;
&lt;p&gt;The best line is one that is closest to all data points , in other terms it’s the line that minimizes the sum of the distances between each point and the fitted line. We can see that visually below:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*pXTkGi4y1wPcPfo1i8xVtA.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;The way to calculate this error is by getting the difference between an observed point and a predicted point (using the line) squaring it and summing this for all data points. Mathematically it looks like :&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/369/1*XiLqraVva35_nHEKMhDrBQ.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Using calculus we can easily get the parameters “a” and “b” for the best line.&lt;/p&gt;
&lt;p&gt;A good thing about linear regression is that it generalizes easily to problems of higher dimension .Its about adding more terms to the equation and calculating more coefficients. A general model looks like this :&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*twPf-JqkR_vntaMRvoo5cQ.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h1&gt;Model evaluation: R squared:&lt;/h1&gt;
&lt;p&gt;How do we determine how well the model fits the data ? One way is to calculate the R² factor.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*pwZOTsK4Av51E7-h2KbJww.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;R-squared is always between 0 and 100%:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;0% indicates that the model explains none of the variability of the response data around its mean.&lt;/li&gt;
&lt;li&gt;100% indicates that the model explains all the variability of the response data around its mean.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In simple terms R squared will give a measure on how better our model is than a model fits the data with it’s mean value. Generally higher values or R-squared are more desirable. We can measure R-squared on the data we used for training but that doesn’t reflect on how well the model will perform in real life, so a good idea is to split the data into training and test and calculate R-squared for both. Generally we’ll observe that model performs better on the training data. Another way to access the model’s performance is through the  &lt;strong&gt;root mean squared error&lt;/strong&gt;, it tells you how concentrated the data is around the regression line. The lower this error the better the model, we can calculate it with the formula :&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/613/1*JXfaeDWbwurv3vrX3iseSw.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h1&gt;Linear regression in python:&lt;/h1&gt;
&lt;p&gt;To apply what we learned we’ll be using a machine learning library in python called skLearn , and the dataset we’re gonna use is about automobile data. The problem is to predict an automobile price based on it’s characteristics. The data looks like this :&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*OLzEqZ8c5gFdH66DJtdoxw.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;The cleaning part is already done so we’re gonna test the models directly. We’ll start by a simple linear regression model.We’ll be splitting the data into test and train. 80% of the data for training and 20% for testing and we’ll check our R-squared score on the training set.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; sklearn&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;model_selection &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; train_test_split
X &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; auto_data&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;drop&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;price&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; axis&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
Y &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; auto_data&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;price&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;  
X_train&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; x_test&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; Y_train&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; y_test &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; train_test_split&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;X&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; Y&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; test_size&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0.2&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; random_state&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; sklearn&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;linear_model &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; LinearRegression
linear_model &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; LinearRegression&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
linear_model&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;fit&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;X_train&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; Y_train&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
&lt;span class=&quot;token comment&quot;&gt;#Checking the score&lt;/span&gt;
linear_model&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;score&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;X_train&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; Y_train&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# OUT: 0.96792273709243304&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We got a really high score on the training set , what about the test set?&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;y_predict &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; linear_model&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;predict&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;x_test&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
&lt;span class=&quot;token operator&quot;&gt;%&lt;/span&gt;pylab inline  
pylab&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;rcParams&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;figure.figsize&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;15&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;plot&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;y_predict&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; label&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Predicted&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;plot&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;y_test&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;values&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; label&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Actual&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;ylabel&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Price&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;legend&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;show&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*q2X7XDIGeiBX8YgXTtvM-A.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;It doesn’t look that good graphically lets check the score&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;r_squared &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; linear_model&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;score&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;x_test&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; y_test&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
r_squared
&lt;span class=&quot;token comment&quot;&gt;# OUT: 0.63225834161155436&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We’ve got a low score , this is known in ML terms as over-fitting the model learned the training set so well that it struggling at generalization. So how can remedy this problem. Well there’s another form of regression that attempts to solve this issue and it’s called &lt;strong&gt;Lasso Regression&lt;/strong&gt;. Instead of minimizing the sum of the errors it adds a penalty term on the coefficients as to force them to be small. Concretely the algorithm will minimize this :&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*8c2QXIzRUcV00F39zc4d6w.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Where α is a parameter we choose. Let’s try it out with an α of 0.5 :&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; sklearn&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;linear_model &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; Lasso
lasso_model &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; Lasso&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;alpha&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0.5&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; normalize&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token boolean&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
lasso_model&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;fit&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;X_train&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; Y_train&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
lasso_model&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;score&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;X_train&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; Y_train&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
&lt;span class=&quot;token comment&quot;&gt;# OUT: 0.96510812725275497&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We’ve got a slightly lower score on the training set. Let’s try the model on the test set:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;y_predict &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; lasso_model&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;predict&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;x_test&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
&lt;span class=&quot;token operator&quot;&gt;%&lt;/span&gt;pylab inline  
pylab&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;rcParams&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;figure.figsize&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;15&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;plot&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;y_predict&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; label&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Predicted&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;plot&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;y_test&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;values&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; label&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Actual&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;ylabel&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Price&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;legend&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;show&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*3BgNad3wqJKHKstPukDRTg.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;This time it seems to fit better let’s check the R-squared value:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;r_square &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; lasso_model&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;score&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;x_test&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; y_test&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
r_square  
&lt;span class=&quot;token comment&quot;&gt;# OUT: 0.887194953444848&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The R-squared score is way better than the simple linear model. We can further improve the performance by tweaking the α parameter. Finding the best parameters for a model is called hyper-parameter tuning and there’s functions in sklearn that makes it easy to find these.&lt;/p&gt;
&lt;h1&gt;Conclusion:&lt;/h1&gt;
&lt;p&gt;In this article we’ve covered how linear regression works , some ways to access it’s performance ,the over-fitting problem and one solution to overcome it. I hope this was of great use to you, in the next article we’ll tackle another algorithm which is logistics regression.&lt;/p&gt;
&lt;p&gt;If you liked this article, be sure to click ❤ below to recommend it and if you have any questions,  &lt;strong&gt;leave a comment&lt;/strong&gt;  and I will do my best to answer.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Train Price Trends]]></title><description><![CDATA[The basics: In this article we’ll try to uncover how linear regression works. The best way to understand it is through example. Suppose we…]]></description><link>https://www.tariqmassaoudi.com/train-price-trends/</link><guid isPermaLink="false">https://www.tariqmassaoudi.com/train-price-trends/</guid><pubDate>Fri, 08 Jun 2018 22:12:03 GMT</pubDate><content:encoded>&lt;h1&gt;The basics:&lt;/h1&gt;
&lt;p&gt;In this article we’ll try to uncover how linear regression works. The best way to understand it is through example. Suppose we have the following problem , we are trying to predict a student’s grade given how many times he didn’t attend the class. With enough data points we’ll end up with a graph that looks like this :&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*gcafDoOHrkd5Xr-qpWS_ew.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Doing a linear regression is finding the line that is closest to all the data points, in mathematics the equation of a line is y=ax+b where “a” is the slope and “b” is the intercept. So to find this line we have to find the best “a” and “b” coefficients. But how do we do that, and what does the “best line” means concretely?&lt;/p&gt;
&lt;h1&gt;Least square regression:&lt;/h1&gt;
&lt;p&gt;The best line is one that is closest to all data points , in other terms it’s the line that minimizes the sum of the distances between each point and the fitted line. We can see that visually below:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*pXTkGi4y1wPcPfo1i8xVtA.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;The way to calculate this error is by getting the difference between an observed point and a predicted point (using the line) squaring it and summing this for all data points. Mathematically it looks like :&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/369/1*XiLqraVva35_nHEKMhDrBQ.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Using calculus we can easily get the parameters “a” and “b” for the best line.&lt;/p&gt;
&lt;p&gt;A good thing about linear regression is that it generalizes easily to problems of higher dimension .Its about adding more terms to the equation and calculating more coefficients. A general model looks like this :&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*twPf-JqkR_vntaMRvoo5cQ.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h1&gt;Model evaluation: R squared:&lt;/h1&gt;
&lt;p&gt;How do we determine how well the model fits the data ? One way is to calculate the R² factor.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*pwZOTsK4Av51E7-h2KbJww.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;R-squared is always between 0 and 100%:&lt;/p&gt;
&lt;ul class=&quot;list-disc&quot;&gt;
&lt;li&gt;0% indicates that the model explains none of the variability of the response data around its mean.&lt;/li&gt;
&lt;li&gt;100% indicates that the model explains all the variability of the response data around its mean.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In simple terms R squared will give a measure on how better our model is than a model fits the data with it’s mean value. Generally higher values or R-squared are more desirable. We can measure R-squared on the data we used for training but that doesn’t reflect on how well the model will perform in real life, so a good idea is to split the data into training and test and calculate R-squared for both. Generally we’ll observe that model performs better on the training data. Another way to access the model’s performance is through the  &lt;strong&gt;root mean squared error&lt;/strong&gt;, it tells you how concentrated the data is around the regression line. The lower this error the better the model, we can calculate it with the formula :&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/613/1*JXfaeDWbwurv3vrX3iseSw.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h1&gt;Linear regression in python:&lt;/h1&gt;
&lt;p&gt;To apply what we learned we’ll be using a machine learning library in python called skLearn , and the dataset we’re gonna use is about automobile data. The problem is to predict an automobile price based on it’s characteristics. The data looks like this :&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*OLzEqZ8c5gFdH66DJtdoxw.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;The cleaning part is already done so we’re gonna test the models directly. We’ll start by a simple linear regression model.We’ll be splitting the data into test and train. 80% of the data for training and 20% for testing and we’ll check our R-squared score on the training set.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;from sklearn.model_selection import train_test_splitX = auto_data.drop(&apos;price&apos;, axis=1)  
Y = auto_data[&apos;price&apos;]  
X_train, x_test, Y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)  
from sklearn.linear_model import LinearRegressionlinear_model = LinearRegression()  
linear_model.fit(X_train, Y_train)  
#Checking the scorelinear_model.score(X_train, Y_train)OUT:  
0.96792273709243304&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We got a really high score on the training set , what about the test set?&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;y_predict = linear_model.predict(x_test)  
%pylab inline  
pylab.rcParams[&apos;figure.figsize&apos;] = (15, 6)plt.plot(y_predict, label=&apos;Predicted&apos;)  
plt.plot(y_test.values, label=&apos;Actual&apos;)  
plt.ylabel(&apos;Price&apos;)plt.legend()  
plt.show()&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*q2X7XDIGeiBX8YgXTtvM-A.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;It doesn’t look that good graphically lets check the score&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;r_squared = linear_model.score(x_test, y_test)  
r_squaredOUT:  
0.63225834161155436&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We’ve got a low score , this is known in ML terms as over-fitting the model learned the training set so well that it struggling at generalization. So how can remedy this problem. Well there’s another form of regression that attempts to solve this issue and it’s called &lt;strong&gt;Lasso Regression&lt;/strong&gt;. Instead of minimizing the sum of the errors it adds a penalty term on the coefficients as to force them to be small. Concretely the algorithm will minimize this :&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*8c2QXIzRUcV00F39zc4d6w.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Where α is a parameter we choose. Let’s try it out with an α of 0.5 :&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;from sklearn.linear_model import Lassolasso_model = Lasso(alpha=0.5, normalize=True)  
lasso_model.fit(X_train, Y_train)  
lasso_model.score(X_train, Y_train)  
OUT:  
0.96510812725275497&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We’ve got a slightly lower score on the training set. Let’s try the model on the test set:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;y_predict = lasso_model.predict(x_test)  
%pylab inline  
pylab.rcParams[&apos;figure.figsize&apos;] = (15, 6)plt.plot(y_predict, label=&apos;Predicted&apos;)  
plt.plot(y_test.values, label=&apos;Actual&apos;)  
plt.ylabel(&apos;Price&apos;)plt.legend()  
plt.show()&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://miro.medium.com/max/700/1*3BgNad3wqJKHKstPukDRTg.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;This time it seems to fit better let’s check the R-squared value:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;r_square = lasso_model.score(x_test, y_test)  
r_square  
OUT:  
0.887194953444848&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The R-squared score is way better than the simple linear model. We can further improve the performance by tweaking the α parameter. Finding the best parameters for a model is called hyper-parameter tuning and there’s functions in sklearn that makes it easy to find these.&lt;/p&gt;
&lt;h1&gt;Conclusion:&lt;/h1&gt;
&lt;p&gt;In this article we’ve covered how linear regression works , some ways to access it’s performance ,the over-fitting problem and one solution to overcome it. I hope this was of great use to you, in the next article we’ll tackle another algorithm which is logistics regression.&lt;/p&gt;
&lt;p&gt;If you liked this article, be sure to click ❤ below to recommend it and if you have any questions,  &lt;strong&gt;leave a comment&lt;/strong&gt;  and I will do my best to answer.&lt;/p&gt;</content:encoded></item></channel></rss>