overcoming the not-insignificant learning curve in using SQL effectively, particularly in specialist applications such as data analysis (as opposed to creating simple reports of simple data).database domain knowledge (as an end user, developer and/or administrator the suggestion that "SQL is faster" I often see is a massive over-simplification), and.working with data in an RDBMS* which may or may not be appropriate for the workload, even if it's just a small SQLite database,.SQL is a domain-specific language for querying relational data (usually in an relational database management system which SQLite, MySQL, Oracle, SQL Server, PostgreSQL etc. Pandas is a data analysis toolkit implemented in Python, a general purpose programming language. Tl dr: It's often down to the use-case, convenience, or a gap in knowledge around the extent of SQL's capabilities.Īs much as there is overlap in the application of these two things, this is comparing apples to oranges. But after that there tends to be a pretty significant gulf in knowledge, until you get to the experts (DBA, Data Engineers, etc.). Some may pick up some of the more advance grouping and iteration syntax. Every beginner learns the 'extraction syntax' of SQL (SELECT, FROM, WHERE, etc.) as a means to get your data from a DB to the next place. The other is, quite simply, that all too many users don't know the extent of SQL's capabilities. From what I've seen the reason why many users, even in these cases, don't go via SQL is two-fold.įirstly, the major advantage pandas has over SQL is that it's part of the wider Python universe, which means in one fell swoop I can load, clean, manipulate, and visualize my data (I can even execute SQL through Pandas.). If you want to do many, repetitive data manipulation tasks and persist the outputs, I'd always recommend trying to go via SQL first.
However, considering cases where the use-case may justify using either Pandas or SQL, you're certainly not wrong. In these cases loading, storing, manipulating and extracting from a database is not viable. One simple reason why you may see a lot more questions around Pandas data manipulation as opposed to SQL is that to use SQL, by definition, means using a database, and a lot of use-cases these days quite simply require bits of data for 'one-and-done' tasks (from. You can probably have many technical discussions around this, but I'm considering the user perspective below. Pandas can solve this but is missing some things when it comes to truly big data or in particular partitions (perhaps improved recently).ĭataFrames should be viewed as a high-level API to SQL routines, even if with pandas they are not at all rendered to some SQL planner.
You can naturally run things line by line in a repl (even in Spark) and view the results.Ĭonsider the example, of adding a new transformed (string mangled column) to a table, then grouping by it and doing some aggregations. The pattern of writing nested routines, commenting them out to check them, and then uncommenting them is replaced by single lines of transformation. The main reason is that DataFrame abstractions allow you to construct SQL statements whilst avoiding verbose and illegible nesting. TLDR SQL is not geared around the (human) development and debugging process, DataFrames are. The real first question is why are people more productive with DataFrame abstractions than pure SQL abstractions.