Feature engineering
With the data loaded, we can think of how we might want to train our model. One possible use case is to create a model that can tell categorize books based on their details.
The Goodreads data does not include categories exactly, but has something similar in popular_shelves
. These are free text tags that users can associate with books. Looking at a book, you can see how often certain shelves are used:
select popular_shelves from graphic_novels limit 5;
[{'count': 228, 'name': to-read}, {'count': 2, 'name': graphic-novels}, {'count': 1, 'name': ff-re-…`
[{'count': 2, 'name': bd}, {'count': 2, 'name': to-read}, {'count': 1, 'name': french-author}, {'co…`
[{'count': 493, 'name': to-read}, {'count': 113, 'name': graphic-novels}, {'count': 102, 'name': co…`
[{'count': 222, 'name': to-read}, {'count': 9, 'name': currently-reading}, {'count': 3, 'name': mil…`
[{'count': 20, 'name': to-read}, {'count': 8, 'name': comics}, {'count': 4, 'name': graphic-novel},…`
Parsing the data out by unpacking and aggregating this field, we can see the most popular shelves:
select
shelf.name as category,
sum(cast(shelf.count as integer)) as category_count
from (
select
unnest(popular_shelves) as shelf
from graphic_novels
)
group by 1
order by 2 desc
limit 15;
category | category_count |
---|---|
to-read | 87252 |
comics | 76283 |
graphic-novels | 67923 |
graphic-novel | 58219 |
currently-reading | 57252 |
fiction | 50014 |
owned | 48936 |
favorites | 47256 |
comic | 46948 |
comics-graphic-novels | 38433 |
fantasy | 37003 |
comic-books | 36638 |
default | 35292 |
books-i-own | 34620 |
library | 31378 |
A lot of these shelves would be hard to use for modeling (such as owned
or default
). But genres such as fantasy
could be interesting. If we continued looking through shelves, these are the most popular genres:
CATEGORIES = [
"fantasy", "horror", "humor", "adventure",
"action", "romance", "ya", "superheroes",
"comedy", "mystery", "supernatural", "drama",
]
Using these categories, we can construct a table of the most common genres and select the single best genre for each book (assuming it was shelved that way at least three times). We can then wrap that query in an asset and materialize it as a table alongside our other DuckDB tables:
@dg.asset(
kinds={"duckdb"},
description="Goodreads shelf feature engineering",
group_name="feature_engineering",
)
def book_category(
duckdb_resource: dg_duckdb.DuckDBResource,
graphic_novels,
):
sql_categories = ", ".join([f"'{s}'" for s in constants.CATEGORIES])
query = f"""
create table if not exists book_category as (
select
book_id,
category
from (
select
book_id,
category,
category_count,
row_number() over (partition by book_id order by category_count desc) as category_rank
from (
select
book_id,
shelf.name as category,
cast(shelf.count as integer) as category_count
from (
select
book_id,
unnest(popular_shelves) as shelf
from graphic_novels
)
where category in ({sql_categories})
and category_count > 3
)
)
where category_rank = 1
);
"""
with duckdb_resource.get_connection() as conn:
conn.execute(query)
Enrichment table
With our book_category
asset created, we can combine that with the author
and graphic_novel
assets to create our final data set we will use for modeling. Here we will both create the table within DuckDB and select its contents into a DataFrame, which we can pass to our next series of assets:
@dg.asset(
kinds={"duckdb"},
description="Combined book, author and shelf data",
group_name="processing",
)
def enriched_graphic_novels(
duckdb_resource: dg_duckdb.DuckDBResource,
graphic_novels,
authors,
book_category,
) -> pd.DataFrame:
query = """
select
book.title as title,
authors.name as author,
book.description as description,
book_category.category
from graphic_novels as book
left join authors
on book.authors[1].author_id = authors.author_id
left join book_category
on book.book_id = book_category.book_id
where nullif(book.description, '') is not null
and category is not null
"""
with duckdb_resource.get_connection() as conn:
return conn.execute(query).fetch_df()
Next steps
- Continue this tutorial with file creation